Set Up Monitoring & Observability Stack
Set Up Monitoring & Observability Stack
Generates comprehensive configurations for monitoring and observability stacks, including Prometheus, Grafana, Jaeger, and Loki, for any specified application.
How to use
Provide details about the system or application you want to monitor in {{args}}. Replace {{app_name}} with your application's name. Optionally, specify {{metrics_path}} if different from /metrics.
Prompt
Monitoring & Observability
Please help set up monitoring and observability for:
{{args}}
Observability Stack
1. Prometheus Configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- /etc/prometheus/rules/*.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
- job_name: {{app_name}}
metrics_path: {{metrics_path|default: /metrics}}
static_configs:
- targets:
- {{app_name}}:8080
- job_name: kube-state-metrics
static_configs:
- targets:
- kube-state-metrics:80802. Alert Rules
groups:
- name: {{app_name}}-alerts
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: High error rate detected
description: Error rate is above 5% for the last 5 minutes
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: High latency detected
description: P95 latency is above 2 seconds
- alert: HighMemoryUsage
expr: |
container_memory_usage_bytes /
container_spec_memory_limit_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: High memory usage
description: Memory usage is above 90%
- alert: HighCPUUsage
expr: |
rate(container_cpu_usage_seconds_total[5m]) /
container_spec_cpu_quota / container_spec_cpu_period > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: High CPU usage
description: CPU usage is above 90%3. Grafana Dashboards
{
"dashboard": {
"title": "{{app_name}} Overview",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{status}}"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m])",
"legendFormat": "5xx errors"
}
]
},
{
"title": "Latency P95",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "P95 Latency"
}
]
}
]
}
}4. Distributed Tracing (Jaeger)
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: jaeger
spec:
collector:
replicas: 2
options:
collector:
zipkin:
host-port: 9411
query:
replicas: 1
storage:
type: elasticsearch
elasticsearch:
nodeCount: 3
redundancyPolicy: SingleRedundancy5. Logging (Loki)
apiVersion: loki.lokigrafana.io/v1
kind: LokiStack
metadata:
name: loki-stack
spec:
size: 1x.small
storage:
secret:
name: loki-storage
type: s3
tenants:
mode: openshift-loggingMetrics to Collect
Application Metrics
- Request rate (total, by endpoint)
- Error rate (by status code)
- Latency (p50, p95, p99)
- Active connections
- Queue lengths
System Metrics
- CPU usage
- Memory usage
- Disk I/O
- Network I/O
- File descriptor usage
Business Metrics
- User signups
- Orders placed
- Revenue
- Conversion rate
Best Practices
SLIs/SLOs
- Availability: 99.9%
- Latency: P95 < 500ms
- Error rate: < 0.1%
- Throughput: X requests/second
Alerting
- Use severity levels appropriately
- Avoid alert fatigue
- Set appropriate for durations
- Include runbooks in annotations
Dashboards
- Use consistent colors
- Include relevant time ranges
- Add helpful descriptions
- Group related metrics
Output Requirements
Provide:
- Prometheus configuration
- Alert rules
- Grafana dashboards
- Recording rules
- Alertmanager configuration
- Documentation for each metric
- Runbooks for alerts