01.Incident 1 — The Missing Log That Cost 6 Hours
A Lambda function was failing silently in production. No errors, no alerts — just documents that never got processed. The issue: an unhandled Promise rejection that didn't propagate to the Lambda runtime. Six hours of debugging later, we added structured logging to every async operation entry point. Now every function logs its input, output, and any exceptions — always.
02.Incident 2 — The N+1 That Took Down the Database
A new feature shipped that accidentally triggered 200+ database queries per API request under load. The ORM was lazy-loading relations in a loop. Database CPU hit 100%, connections were exhausted, and the service became unresponsive within minutes of a traffic spike. The fix took 10 minutes. The detection took 3 hours. Lesson: always test with production-like data volumes.
03.Always Expect Network Failures
Every external call — database, Redis, third-party API — will fail at some point. Design for it from day one. Use retry logic with exponential backoff and jitter. Set explicit timeouts on every HTTP client call. Wrap external dependencies in circuit breakers. Log every failure with enough context to diagnose it without reproducing it.
const response = await axios.get(url, {
timeout: 5000, // never omit this
retry: 3,
retryDelay: (attempt) => Math.pow(2, attempt) * 100 + Math.random() * 100,
});04.Distributed Tracing Is Not Optional
In a microservices environment, a single user request may touch 5–10 services. When it fails, you need a trace ID that follows the request across all of them. Use AWS X-Ray, OpenTelemetry, or Datadog APM. Inject the trace ID into every log line. This turns a 6-hour debugging session into a 10-minute one.
05.Post-Mortems Without Blame
After every significant incident, write a post-mortem. Not to assign blame — to understand the system failure. What failed? Why did it fail? Why didn't our monitoring catch it? What would have prevented it? The goal is to make the system more resilient, not to find a scapegoat. The best engineering cultures treat incidents as learning opportunities.
- ▸Timeline of what happened and when
- ▸Root cause (systems, not people)
- ▸Detection gap — why didn't we know sooner?
- ▸Action items with owners and deadlines
- ▸Follow-up review in 2 weeks