Trace Issue
How to use
Describe the issue or error in place of {{args}} to get a step-by-step root cause analysis and troubleshooting guide.
Prompt
Root Cause Analysis - Trace Issue
Please perform a comprehensive root cause analysis for the following issue:
{{args}}
Root Cause Analysis Framework
1. Issue Definition
Problem Statement
- What is the observed issue?
- What is the expected behavior?
- What is the actual behavior?
- When was it first observed?
- How frequently does it occur?
Impact Assessment
- Who is affected? (All users, specific users, admins)
- How severe is the impact? (Critical, High, Medium, Low)
- What functionality is broken?
- Is there a workaround available?
2. Information Gathering
Symptoms Collection
- Error messages or logs
- Stack traces
- Screenshots or recordings
- User reports
- Monitoring/metrics data
Environmental Factors
- Environment: (Development, Staging, Production)
- Time of occurrence
- Frequency pattern (constant, intermittent, periodic)
- Affected platforms/browsers
- Affected user segments
Recent Changes
- Code deployments
- Configuration changes
- Infrastructure changes
- Dependency updates
- Data migrations
3. The 5 Whys Method
Use iterative "why" questioning to dig deeper:
Issue: Users can't log in
Why? → The authentication service is returning 500 errors
Why? → The database connection pool is exhausted
Why? → Connections are not being released properly
Why? → The ORM is not closing connections on error
Why? → Missing error handling in the database layer
Root Cause: Inadequate error handling causing connection leaks4. Timeline Analysis
Create a timeline of events:
T-0: Issue first reported
T-3 hours: Last successful login
T-4 hours: Database migration deployed
T-6 hours: Traffic spike observedIdentify correlations between events and the issue.
5. Reproduce the Issue
Reproduction Steps
- Detailed steps to reproduce
- Required preconditions
- Expected vs actual results
- Frequency of reproduction
Minimal Reproduction
- Simplest case that reproduces the issue
- Isolate variables
- Test in controlled environment
6. Hypothesis Formation
Potential Causes
List all potential root causes:
Hypothesis A: Database connection leak
- Evidence: Connection pool exhaustion
- Likelihood: High
- Test: Monitor connection usage
Hypothesis B: Memory leak in service
- Evidence: Increasing memory usage
- Likelihood: Medium
- Test: Profile memory over time
Hypothesis C: Network timeout misconfiguration
- Evidence: Intermittent failures
- Likelihood: Low
- Test: Check timeout settings
7. Investigation Techniques
Log Analysis
# Search for errors in time window
grep "ERROR" app.log | grep "2024-01-15 14:*"
# Count error occurrences
grep "ERROR" app.log | cut -d' ' -f5 | sort | uniq -c
# Correlate with other events
grep -A 5 -B 5 "Connection timeout" app.logCode Analysis
- Review recent changes (git diff)
- Check related code paths
- Look for similar past issues
- Review error handling
Data Analysis
- Check database state
- Review recent data changes
- Analyze query performance
- Check for data anomalies
Performance Profiling
- CPU profiling
- Memory profiling
- Network analysis
- Database query analysis
8. Common Root Cause Patterns
Resource Exhaustion
- Memory leaks
- Connection pool exhaustion
- File descriptor limits
- Disk space issues
Race Conditions
- Concurrent access issues
- Timing-dependent bugs
- Synchronization problems
- State inconsistencies
Configuration Issues
- Wrong environment variables
- Missing configuration
- Incorrect timeouts
- Feature flags
Dependency Problems
- Version incompatibilities
- Breaking changes in dependencies
- Missing dependencies
- Conflicting dependencies
Data-Related
- Data corruption
- Missing data
- Invalid data format
- Schema mismatches
Infrastructure
- Network issues
- Server overload
- DNS problems
- Load balancer misconfiguration
9. Testing Hypotheses
For each hypothesis, design tests:
// Hypothesis: Connection leak in error path
async function testConnectionLeak() {
const initialConnections = await getConnectionCount();
// Trigger error condition 100 times
for (let i = 0; i < 100; i++) {
try {
await triggerErrorCondition();
} catch (e) {
// Expected to fail
}
}
const finalConnections = await getConnectionCount();
const leaked = finalConnections - initialConnections;
console.log(`Leaked connections: ${leaked}`);
return leaked;
}10. Fishbone Diagram (Cause & Effect)
Organize potential causes by category:
Problem: Authentication Failures
│
├─ People
│ ├─ Insufficient error handling in code
│ └─ Lack of monitoring
│
├─ Process
│ ├─ No rollback procedure
│ └─ Inadequate testing
│
├─ Technology
│ ├─ Database connection pool size
│ ├─ Network timeout configuration
│ └─ ORM connection management
│
├─ Environment
│ ├─ High traffic load
│ └─ Infrastructure capacity
│
└─ Data
├─ Database migration issues
└─ Data corruption11. Root Cause Determination
Evidence-Based Conclusion
- List all supporting evidence
- Rule out alternative causes
- Confirm through testing
- Verify fix resolves issue
Root Cause Statement
"The authentication failures are caused by {{specific technical cause}} which occurs when {{conditions}}, resulting in {{observed behavior}}."
12. Impact Analysis
Affected Components
- List all affected systems
- Dependency map
- Blast radius assessment
Affected Users
- User segments impacted
- Severity of impact
- Duration of impact
13. Solution Design
Immediate Fix (Hotfix)
- Quick mitigation
- Minimal risk changes
- Deploy urgently
Long-term Solution
- Proper fix addressing root cause
- Architecture improvements
- Prevention measures
Prevention
- Tests to add
- Monitoring to implement
- Alerts to create
- Documentation to write
14. Verification Plan
Verification Steps
- Apply fix in test environment
- Reproduce original issue
- Verify issue is resolved
- Test edge cases
- Check for side effects
- Monitor in production
15. Output Format
Provide:
1. Executive Summary
- Issue description
- Root cause (one sentence)
- Impact
- Resolution
2. Detailed Analysis
- Timeline
- Investigation process
- Evidence collected
- Hypotheses tested
3. Root Cause
- Technical explanation
- Why it wasn't caught earlier
- Contributing factors
4. Solution
- Immediate fix
- Long-term solution
- Implementation steps
5. Prevention
- Tests to add
- Monitoring improvements
- Process changes
- Documentation updates
6. Lessons Learned
- What went wrong
- What went right
- What to improve
Generate a complete root cause analysis following this framework.