Set Up Monitoring & Observability Stack

Generates comprehensive configurations for monitoring and observability stacks, including Prometheus, Grafana, Jaeger, and Loki, for any specified application.

How to use

Provide details about the system or application you want to monitor in {{args}}. Replace {{app_name}} with your application's name. Optionally, specify {{metrics_path}} if different from /metrics.

Prompt

Monitoring & Observability

Please help set up monitoring and observability for:

{{args}}

Observability Stack

1. Prometheus Configuration

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - /etc/prometheus/rules/*.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets:
          - localhost:9090

  - job_name: {{app_name}}
    metrics_path: {{metrics_path|default: /metrics}}
    static_configs:
      - targets:
          - {{app_name}}:8080

  - job_name: kube-state-metrics
    static_configs:
      - targets:
          - kube-state-metrics:8080

2. Alert Rules

groups:
  - name: {{app_name}}-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m]) / 
          rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: High error rate detected
          description: Error rate is above 5% for the last 5 minutes

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, 
            rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: High latency detected
          description: P95 latency is above 2 seconds

      - alert: HighMemoryUsage
        expr: |
          container_memory_usage_bytes / 
          container_spec_memory_limit_bytes > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: High memory usage
          description: Memory usage is above 90%

      - alert: HighCPUUsage
        expr: |
          rate(container_cpu_usage_seconds_total[5m]) / 
          container_spec_cpu_quota / container_spec_cpu_period > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: High CPU usage
          description: CPU usage is above 90%

3. Grafana Dashboards

{
  "dashboard": {
    "title": "{{app_name}} Overview",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])",
            "legendFormat": "{{status}}"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total{status=~\"5..\"}[5m])",
            "legendFormat": "5xx errors"
          }
        ]
      },
      {
        "title": "Latency P95",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "P95 Latency"
          }
        ]
      }
    ]
  }
}

4. Distributed Tracing (Jaeger)

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
spec:
  collector:
    replicas: 2
    options:
      collector:
        zipkin:
          host-port: 9411
  query:
    replicas: 1
  storage:
    type: elasticsearch
    elasticsearch:
      nodeCount: 3
      redundancyPolicy: SingleRedundancy

5. Logging (Loki)

apiVersion: loki.lokigrafana.io/v1
kind: LokiStack
metadata:
  name: loki-stack
spec:
  size: 1x.small
  storage:
    secret:
      name: loki-storage
      type: s3
  tenants:
    mode: openshift-logging

Metrics to Collect

Application Metrics

  • Request rate (total, by endpoint)
  • Error rate (by status code)
  • Latency (p50, p95, p99)
  • Active connections
  • Queue lengths

System Metrics

  • CPU usage
  • Memory usage
  • Disk I/O
  • Network I/O
  • File descriptor usage

Business Metrics

  • User signups
  • Orders placed
  • Revenue
  • Conversion rate

Best Practices

SLIs/SLOs

  • Availability: 99.9%
  • Latency: P95 < 500ms
  • Error rate: < 0.1%
  • Throughput: X requests/second

Alerting

  • Use severity levels appropriately
  • Avoid alert fatigue
  • Set appropriate for durations
  • Include runbooks in annotations

Dashboards

  • Use consistent colors
  • Include relevant time ranges
  • Add helpful descriptions
  • Group related metrics

Output Requirements

Provide:

  1. Prometheus configuration
  2. Alert rules
  3. Grafana dashboards
  4. Recording rules
  5. Alertmanager configuration
  6. Documentation for each metric
  7. Runbooks for alerts