Incident Response Playbook: When Your App Goes Down at 2 AM

Veld Systems||7 min read

Your phone buzzes at 2 AM. The monitoring alert says your API is returning 500 errors at a 40% rate and climbing. Users are hitting your support channels. Revenue is bleeding every minute the system is down. What you do in the next 30 minutes determines whether this is a minor blip or a company defining crisis.

We have handled production incidents for systems serving hundreds of thousands of users. The difference between teams that resolve incidents in 15 minutes and teams that fumble for hours is never raw technical skill. It is preparation. Teams with a rehearsed playbook stay calm, move fast, and fix the problem. Teams without one panic, chase the wrong leads, and make things worse.

This is the playbook we use. Adapt it to your team and your systems.

Phase 1: Detection, The First 5 Minutes

The best incident response starts before the incident. If you find out your app is down because a customer tweets about it, you have already failed the detection phase. Automated monitoring should alert you before any user notices.

Your detection stack needs:

- Uptime monitoring that pings critical endpoints every 30 to 60 seconds

- Error rate alerting that fires when 5xx errors exceed a threshold (we use 1% as the trigger)

- Latency alerting that fires when p95 response times exceed acceptable limits

- Infrastructure alerting for CPU, memory, disk, and database connection saturation

- Business metric alerting for sudden drops in signups, transactions, or other key metrics

If you do not have monitoring and observability in place, stop reading this playbook and set that up first. You cannot respond to incidents you do not know about.

When the alert fires, the first responder acknowledges it immediately. This is not about fixing anything yet. It is about telling the team: someone is on it. In Slack, PagerDuty, or whatever your alerting tool is, acknowledge the alert so that the escalation chain does not keep firing.

Phase 2: Triage, Minutes 5 Through 15

Triage has one goal: understand the scope and severity of the incident. Do not start fixing things yet. Gather information first.

Check the basics in order:

1. Is the application running? Check your deployment platform. Are all instances healthy? Did a recent deployment happen?

2. Is the database accessible? Check connection counts, query latency, and replication lag. The database is the root cause of the majority of production outages we see.

3. Are external dependencies up? Check the status pages of your payment processor, email provider, CDN, and any third party APIs you depend on. If Stripe is down, your checkout is down, and there is nothing you can fix on your end.

4. What changed recently? Check deployment history, configuration changes, and infrastructure modifications. In our experience, 80% of production incidents are caused by a change deployed in the last 24 hours.

Assign a severity level:

- SEV1, Total outage. The application is completely unavailable. All users are affected. Revenue is stopping. This gets all hands on deck immediately.

- SEV2, Major degradation. Core functionality is impaired but the application is partially usable. Some users are affected. This gets the on call engineer plus one backup.

- SEV3, Minor degradation. A non critical feature is broken or performance is degraded. Most users are unaffected. This gets the on call engineer.

Communicate immediately. Post a status update on your status page and in your internal incident channel. Even if you do not know the cause yet, communicate that you are aware and investigating. Silence during an outage is worse than admitting you do not have answers yet.

Phase 3: Containment, Minutes 15 Through 30

Containment is about stopping the bleeding, not performing surgery. The goal is to get the system back to a stable state, even if it means reduced functionality.

Your containment options, in order of preference:

Roll back the last deployment. If a deployment happened in the last 24 hours and correlates with the incident start time, roll it back. Do not debug the deployment during an outage. Roll back first, investigate later. This single action resolves the majority of production incidents.

Scale up resources. If the database is connection saturated, increase the instance size or connection pool limits. If application servers are CPU bound, add more instances. This buys time while you identify the root cause.

Disable non critical features. If a specific feature is causing cascading failures (a runaway background job, a broken webhook handler, a misbehaving third party integration), disable it via feature flag. Keep the core application running.

Failover to a secondary region. If you have multi region infrastructure, trigger a failover. This is the nuclear option but it works when an entire region is having issues.

Enable maintenance mode. As a last resort, show a maintenance page. This is better than showing error pages because it communicates intentional action rather than system failure.

Phase 4: Resolution

Once the system is stable (even if degraded), you can investigate and fix the root cause properly. This phase can take minutes or hours depending on the issue.

Work methodically:

- Check logs. Application logs, database logs, infrastructure logs. Look for error patterns that correlate with the incident start time. Your observability stack should make this fast.

- Reproduce the issue in staging if possible before applying a fix to production.

- Test the fix before deploying. Do not deploy untested code to production during an incident. That is how you turn a SEV2 into a SEV1.

- Deploy the fix and monitor closely for 15 to 30 minutes afterward to confirm the issue is resolved.

- Restore full functionality by re enabling any features you disabled during containment.

Phase 5: Communication

Throughout the incident, keep three audiences informed:

Your team. Use a dedicated incident channel. Post updates every 15 minutes, even if the update is "still investigating." This prevents people from interrupting the responders with status requests.

Your customers. Update your status page when the incident starts, when you have containment, when you deploy a fix, and when the incident is fully resolved. Be honest about what happened. Customers forgive outages. They do not forgive dishonesty.

Your stakeholders. After the incident is resolved, send a brief summary to leadership: what happened, how long it lasted, what the impact was, and what you are doing to prevent recurrence.

Phase 6: Post Mortem

Every SEV1 and SEV2 incident gets a post mortem within 48 hours. This is non negotiable. Without it, you will have the same incident again.

The post mortem document includes:

- Timeline of the incident from first alert to full resolution

- Root cause identified and verified

- Contributing factors that made the incident worse or delayed resolution

- What went well in the response

- What went poorly in the response

- Action items with owners and deadlines

The most important rule of post mortems: they are blameless. The goal is to improve systems and processes, not to punish individuals. If someone deployed a bad change, the question is not "why did they deploy bad code" but rather "why did our testing, review, and deployment process allow bad code to reach production?"

Action items from post mortems must be tracked and completed. A post mortem that generates action items nobody follows up on is worse than no post mortem at all because it creates the illusion of improvement.

Building the Playbook Before You Need It

Do not wait for an outage to create your incident response process. Build it now, while things are calm.

Runbook every critical system. For each service, document: how to check its health, how to restart it, how to roll back its deployment, how to scale it, and who owns it. Store these runbooks somewhere accessible during an outage (not behind the authentication system that might also be down).

Establish an on call rotation. Someone needs to be reachable 24/7. Rotate weekly to prevent burnout. Provide compensatory time off for overnight incidents.

Run game days. Intentionally break things in staging (or production, if you are brave enough for chaos engineering) and practice your incident response. The first time your team runs through the playbook should not be during a real outage.

Automate common fixes. If you have rolled back a deployment during three separate incidents, build a one click rollback button. If database connection saturation has caused two outages, build an auto scaling rule. Every incident is a signal pointing at automation opportunity.

The Cost of Not Having a Playbook

We worked with a company that had a 6 hour outage because nobody knew who was supposed to respond, nobody knew how to roll back the deployment, and the one person who understood the infrastructure was on a flight. Six hours of downtime, over $200,000 in lost revenue, and weeks of customer trust recovery. The engineering effort to prevent that was about 2 days of playbook documentation and automation.

If your team does not have an incident response playbook, you are one bad deployment away from the same story. We build these systems as part of our ongoing management engagements because we know from experience that operational readiness is not optional for production systems.

Do not wait for the 2 AM wake up call to start thinking about incident response. Talk to us about building a response framework that keeps your systems, and your team, resilient when things go wrong.

Ready to Build?

Let us talk about your project

We take on 3-4 projects at a time. Get an honest assessment within 24 hours.