Provide details about the system or application you want to monitor in {{args}}. Replace {{app_name}} with your application's name. Optionally, specify {{metrics_path}} if different from /metrics.

Monitoring & Observability

Please help set up monitoring and observability for:

Observability Stack

1. Prometheus Configuration

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - /etc/prometheus/rules/*.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets:
          - localhost:9090

  - job_name: {{app_name}}
    metrics_path: {{metrics_path|default: /metrics}}
    static_configs:
      - targets:
          - {{app_name}}:8080

  - job_name: kube-state-metrics
    static_configs:
      - targets:
          - kube-state-metrics:8080

2. Alert Rules

groups:
  - name: {{app_name}}-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m]) / 
          rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: High error rate detected
          description: Error rate is above 5% for the last 5 minutes

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, 
            rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: High latency detected
          description: P95 latency is above 2 seconds

      - alert: HighMemoryUsage
        expr: |
          container_memory_usage_bytes / 
          container_spec_memory_limit_bytes > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: High memory usage
          description: Memory usage is above 90%

      - alert: HighCPUUsage
        expr: |
          rate(container_cpu_usage_seconds_total[5m]) / 
          container_spec_cpu_quota / container_spec_cpu_period > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: High CPU usage
          description: CPU usage is above 90%

3. Grafana Dashboards

{
  "dashboard": {
    "title": "{{app_name}} Overview",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])",
            "legendFormat": "{{status}}"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total{status=~\"5..\"}[5m])",
            "legendFormat": "5xx errors"
          }
        ]
      },
      {
        "title": "Latency P95",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "P95 Latency"
          }
        ]
      }
    ]
  }
}

4. Distributed Tracing (Jaeger)

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
spec:
  collector:
    replicas: 2
    options:
      collector:
        zipkin:
          host-port: 9411
  query:
    replicas: 1
  storage:
    type: elasticsearch
    elasticsearch:
      nodeCount: 3
      redundancyPolicy: SingleRedundancy

5. Logging (Loki)

apiVersion: loki.lokigrafana.io/v1
kind: LokiStack
metadata:
  name: loki-stack
spec:
  size: 1x.small
  storage:
    secret:
      name: loki-storage
      type: s3
  tenants:
    mode: openshift-logging

Metrics to Collect

Application Metrics

Request rate (total, by endpoint)
Error rate (by status code)
Latency (p50, p95, p99)
Active connections
Queue lengths

System Metrics

CPU usage
Memory usage
Disk I/O
Network I/O
File descriptor usage

Business Metrics

User signups
Orders placed
Revenue
Conversion rate

Best Practices

SLIs/SLOs

Availability: 99.9%
Latency: P95 < 500ms
Error rate: < 0.1%
Throughput: X requests/second

Alerting

Use severity levels appropriately
Avoid alert fatigue
Set appropriate for durations
Include runbooks in annotations

Dashboards

Use consistent colors
Include relevant time ranges
Add helpful descriptions
Group related metrics

Output Requirements

Provide:

Prometheus configuration
Alert rules
Grafana dashboards
Recording rules
Alertmanager configuration
Documentation for each metric
Runbooks for alerts

Set Up Monitoring & Observability Stack

How to use

Prompt