Zero Downtime Deployments: How to Ship Without Downtime

Veld Systems||7 min read

Every deployment is a risk. Traditional deployments involve stopping the old version, starting the new one, and hoping everything works. During that window, your users see errors, your API returns 502s, and your team sweats through a Slack thread. For a consumer product, even 30 seconds of downtime during peak hours can mean lost revenue and frustrated users. For a B2B SaaS product, downtime means breached SLAs and eroded trust.

Zero downtime deployment eliminates that window entirely. Users never see an error page. Traffic flows continuously. The new version rolls out while the old version is still serving requests. Here is how to make it work.

Why Traditional Deployments Cause Downtime

The typical deployment process looks like this: pull the new code, stop the application process, run database migrations, start the application with the new code. Between the stop and start, there is a gap. Depending on your stack, that gap lasts anywhere from 2 seconds to 5 minutes.

Even with a "quick restart" approach, there is a period where in flight requests are dropped, WebSocket connections are severed, and background jobs may fail mid execution. If the new version has a startup error, the gap extends until someone notices and rolls back.

The core problem is that deployment and availability are coupled. Zero downtime deployment decouples them by ensuring the old version keeps running until the new version is verified and ready to serve traffic.

Deployment Strategies

Blue Green Deployment

You maintain two identical environments: Blue (currently serving traffic) and Green (idle). You deploy the new version to Green, run smoke tests against it, then switch the load balancer to point at Green. Blue becomes idle and serves as your instant rollback target.

Advantages: Instant rollback (just switch the load balancer back), full environment isolation, and you can run comprehensive tests against the new version before any user sees it.

Tradeoffs: You need double the infrastructure (though only temporarily under active load), and database migrations need special handling since both environments share the same database.

Best for: Applications where rollback speed is critical and you can afford the infrastructure overhead. This is the strategy we use most often for client projects because the rollback story is clean.

Rolling Deployment

If you run multiple instances of your application behind a load balancer, you update them one at a time. The load balancer routes traffic away from the instance being updated, the instance restarts with the new version, and the load balancer adds it back to the pool. Repeat for each instance.

Advantages: No extra infrastructure needed, works naturally with container orchestrators like Kubernetes or ECS, and you get a gradual rollout that can be paused if issues appear.

Tradeoffs: During the rollout, some users hit the old version and some hit the new version simultaneously. If the versions are not backward compatible, this causes errors. Rollback is slower because you have to roll forward through all instances again.

Best for: Stateless applications with backward compatible releases running on container orchestration platforms. For more on choosing between orchestration approaches, see our comparison of serverless vs. Kubernetes.

Canary Deployment

Route a small percentage of traffic (1% to 10%) to the new version while the rest continues hitting the old version. Monitor error rates, response times, and business metrics on the canary. If everything looks good, gradually increase the percentage until 100% of traffic is on the new version.

Advantages: Minimizes blast radius. If the new version has a bug, only a small percentage of users are affected. Provides real production validation before full rollout.

Tradeoffs: Requires sophisticated traffic routing (weighted load balancing), good monitoring to detect issues at low traffic volumes, and the ability to quickly route traffic away from the canary. Both versions must be backward compatible.

Best for: High traffic applications where even a brief full rollout of a buggy version would affect thousands of users.

Database Migrations: The Hard Part

The deployment strategy handles your application code. Database migrations are where zero downtime deployments get genuinely difficult. A standard migration that locks a table or changes a column type will cause downtime regardless of your deployment strategy.

Rules for Zero Downtime Migrations

Never rename a column in a single step. Instead: add the new column, backfill data, update the application to write to both columns, deploy, then remove the old column in a later release.

Never drop a column in the same release that stops using it. The old version of your application (still running during the rollout) will try to read from or write to the dropped column and fail. Drop columns in a subsequent release after the previous version is fully retired.

Never run long locking operations on large tables. Adding an index on a table with millions of rows will lock it for minutes. Use CREATE INDEX CONCURRENTLY (Postgres) or equivalent non locking syntax for your database.

Never change a column type in place. The same add, backfill, migrate, drop pattern applies.

The general principle is expand then contract: first expand your schema to support both old and new, deploy the new application code, then contract the schema by removing the old columns or tables. This adds a release cycle to any schema change but eliminates the risk of downtime.

Health Checks and Readiness Probes

Your load balancer needs to know when an instance is ready to receive traffic. This requires two types of health checks:

Liveness check. Is the process running? This is a simple HTTP endpoint that returns 200 if the application is alive. If it fails, the instance should be restarted.

Readiness check. Is the process ready to handle requests? This is more nuanced. It should verify database connectivity, cache availability, and any required external dependencies. The load balancer should not route traffic to an instance that is alive but not ready (for example, still warming caches or waiting for a database connection).

Do not skip readiness checks. Without them, the load balancer will send traffic to an instance that just started but has not finished initialization, resulting in a burst of errors during deployment.

Monitoring During Deployment

Zero downtime deployment does not mean zero risk deployment. You still need to watch for issues as the new version takes traffic. At minimum, monitor these during every deployment:

- Error rate: Any increase above the baseline should pause the rollout

- Response time: P50 and P99 latency. A spike in P99 often indicates a performance regression

- Business metrics: Conversion rate, checkout completions, API success rates

- Resource utilization: CPU, memory, and database connection counts

Automated rollback based on these metrics is the gold standard. If error rate exceeds 1% or P99 latency doubles, the system automatically routes traffic back to the old version. This requires robust monitoring and observability infrastructure, but it transforms deployments from high stress events into routine operations.

Practical Implementation on AWS

For teams running on AWS, here is a concrete implementation path:

ECS with blue green deployment via CodeDeploy. Define two target groups in your Application Load Balancer. CodeDeploy manages the cutover, routing traffic from the old target group to the new one. You configure test listeners, health check grace periods, and automatic rollback triggers. The entire process is managed and repeatable.

Lambda with versioned aliases. Each deployment creates a new version. A weighted alias shifts traffic gradually from the old version to the new one. Rollback is instant by pointing the alias back to the old version.

For infrastructure cost considerations, which become relevant when running double environments for blue green deployments, our guide on reducing AWS cloud costs covers strategies that keep zero downtime deployment affordable.

The CI/CD Pipeline

Zero downtime deployment only works if it is automated. Manual deployments with zero downtime strategies are possible but error prone and stressful. A proper CI/CD pipeline for zero downtime deployment includes:

1. Code pushed to main branch triggers the pipeline

2. Automated tests run (unit, integration, end to end)

3. New version is built and containerized

4. Container is deployed to the staging environment

5. Smoke tests run against staging

6. Deployment to production begins (blue green, rolling, or canary)

7. Health checks verify the new version

8. Traffic shifts to the new version

9. Old version is kept alive for a rollback window (15 to 30 minutes)

10. Old version is decommissioned

The entire process should complete in under 15 minutes for most applications and require zero human intervention for routine deployments.

What It Takes to Set This Up

Setting up zero downtime deployment from scratch typically takes 1 to 3 weeks depending on your current infrastructure. If you are already running on containers with a load balancer, it is closer to a week. If you are deploying to a single server with manual SSH commands, expect 3 weeks to containerize, set up orchestration, and build the pipeline.

This is core infrastructure work that pays dividends on every subsequent deployment. Instead of a stressful, error prone event that happens after hours, deployments become a non event that happens multiple times per day.

Our cloud and DevOps service includes zero downtime deployment setup as a standard part of infrastructure buildouts. If you want to stop scheduling deployments around low traffic windows and start shipping with confidence, get in touch. We will get your pipeline to a point where deploying is boring, and boring is exactly what you want.

Ready to Build?

Let us talk about your project

We take on 3-4 projects at a time. Get an honest assessment within 24 hours.