Database Backup Strategy That Actually Works

Veld Systems||7 min read

Most companies think they have a backup strategy. They set up automated snapshots when the database was first provisioned, got a green checkmark in the console, and never thought about it again. Then something goes wrong, a bad migration wipes a table, a disgruntled employee deletes records, or a cloud region goes offline, and they discover their "strategy" has not actually been tested since the day it was created.

We have been on the receiving end of those calls more than once. The pattern is always the same: panic, followed by the realization that backups exist but nobody knows if they work, how old they are, or how long a restore will take. A backup you have never restored is not a backup. It is a hope.

The Three Pillars of a Real Backup Strategy

Every production database needs three layers of protection: automated snapshots, continuous replication, and tested restores. Miss any one of these and you have a gap that will eventually cost you.

Automated snapshots are the baseline. Every managed database service, whether it is AWS RDS, Supabase, or Google Cloud SQL, offers point in time snapshots. Set these to run every hour at minimum for production systems. Daily snapshots are not enough when your business processes thousands of transactions per day. An 8 hour old snapshot means 8 hours of lost data, and for many businesses that is unrecoverable.

Continuous replication adds a second layer. This means streaming your write ahead log (WAL) to a separate storage location, ideally in a different region. With PostgreSQL, you can configure WAL archiving to ship logs to S3 or equivalent object storage. This gives you point in time recovery down to the second, not just the last snapshot. If your snapshot is from 2 AM and the incident happens at 10 AM, WAL replay bridges that gap.

Tested restores are the layer everyone skips. We schedule restore drills quarterly for every production system we manage. The drill is simple: take the most recent backup, restore it to a staging environment, run validation queries against it, and measure how long the entire process takes. If you cannot restore your backup under pressure at 2 AM, the drill will show you that before a real incident does.

Designing Your Snapshot Schedule

Not all data has the same value or the same rate of change. A user accounts table that changes a few times per day needs a different snapshot cadence than a transactions table that processes hundreds of writes per minute.

Tier your backup frequency by data criticality:

- Critical data (transactions, payments, user records): Hourly snapshots plus continuous WAL archiving

- Important data (application state, configurations, content): Every 6 hours plus daily full backups

- Archival data (logs, analytics, historical reports): Daily snapshots with 30 day retention

Most cloud providers charge by storage volume, not snapshot frequency, so the cost difference between daily and hourly snapshots is negligible. The cost of losing 23 hours of data versus losing 1 hour of data is not.

Retention policy matters just as much as frequency. We typically recommend keeping hourly snapshots for 7 days, daily snapshots for 30 days, and weekly snapshots for 90 days. For regulated industries, extend that to match your compliance requirements. Storage is cheap. Explaining to an auditor why you cannot produce records from 60 days ago is expensive.

Offsite and Cross Region Replication

If all your backups live in the same region as your primary database, a regional outage takes out everything, your data and your ability to recover it. This is not theoretical. AWS us-east-1 has had multiple significant outages in recent years, and every one of them reminded teams that single region backup strategies are incomplete.

Cross region replication copies your backups to a geographically separate location. For AWS, this means replicating snapshots from us-east-1 to us-west-2. For multi cloud setups, it means copying to a completely different provider.

The implementation is straightforward. Most managed services support cross region snapshot copy as a built in feature. For custom PostgreSQL deployments, tools like pgBackRest handle this with a few configuration lines. The key decisions are:

- Which regions? Pick regions that are geographically distant enough that a natural disaster or network event would not hit both simultaneously

- Encryption in transit and at rest? Always. No exceptions

- Replication lag tolerance? For disaster recovery, a few minutes of lag is acceptable. For compliance, you may need synchronous replication

If your cloud infrastructure does not include cross region backup replication today, that should be the first thing you fix after reading this post.

Automating Restore Validation

Manual restore testing is better than no testing, but it depends on someone remembering to do it. Automate the entire process instead.

Here is what an automated restore validation pipeline looks like:

1. Trigger on a weekly or monthly schedule via cron or your CI/CD system

2. Provision a temporary database instance from the latest backup

3. Run validation queries that check row counts, data integrity, and schema correctness against known baselines

4. Measure restore time and compare it against your recovery time objective (RTO)

5. Alert if any validation fails or restore time exceeds your target

6. Tear down the temporary instance to avoid cost accumulation

This pipeline costs almost nothing to run, a few dollars per month in compute for the temporary instances, and gives you continuous confidence that your backups are viable. If you already have a CI/CD pipeline, adding restore validation as another stage is a natural extension.

Recovery Time and Recovery Point Objectives

Two numbers define whether your backup strategy is adequate: RTO (Recovery Time Objective) and RPO (Recovery Point Objective).

RTO is how long you can afford to be down. If your ecommerce platform generates $10,000 per hour in revenue, a 4 hour restore time means $40,000 in lost sales plus the reputational damage of extended downtime.

RPO is how much data you can afford to lose. If your RPO is 1 hour but your snapshots only run daily, you have a 23 hour gap between what you promised and what you can deliver.

Work backwards from business impact to set these numbers. Ask your stakeholders: how long can the business operate without this system? How much data loss triggers a compliance violation or a customer trust issue? Those answers determine your snapshot frequency, replication strategy, and restore automation requirements.

For most of the systems we build and manage, we target an RTO under 30 minutes and an RPO under 5 minutes. Achieving that requires the combination of frequent snapshots, WAL archiving, cross region replication, and automated restore runbooks. It sounds complex, but once it is set up, it runs itself.

Common Mistakes We See

Backing up the database but not the application state. Your database backup is useless if you cannot also restore the application version that is compatible with that schema. Tag your backups with the corresponding application version or migration number.

Ignoring blob storage. If your application stores files in S3 or equivalent, those need their own backup strategy. Database snapshots do not capture your file storage. Enable versioning and cross region replication on every production bucket.

No encryption. Backups contain your most sensitive data. If your backups are unencrypted, a single compromised storage bucket exposes everything. Use AES-256 encryption at rest and TLS in transit, always.

Single person knowledge. If only one engineer knows how to restore from backup, you do not have a strategy. You have a single point of failure. Document the restore process, store it somewhere accessible during an outage (not in the database you are trying to restore), and make sure at least two people have practiced it.

What a Production Grade Setup Looks Like

For a PostgreSQL database on AWS (the stack we use most frequently), our standard configuration includes:

- Hourly automated snapshots with 7 day retention

- Continuous WAL archiving to S3 with 30 day retention

- Cross region snapshot replication to a secondary region

- Weekly automated restore validation with Slack alerts

- Documented runbook stored in the repository, not a wiki behind a login

- Monitoring and alerting on backup job success and failure

This setup takes about a day to implement from scratch and costs under $50 per month for most small to mid size databases. Compare that to the cost of a single data loss event, and the ROI is immediate.

Stop Hoping Your Backups Work

Your database is likely the most valuable technical asset your business owns. Every customer record, every transaction, every piece of content lives there. Treating its protection as a set and forget checkbox is one of the most common and most preventable mistakes we see.

If you do not have a tested, automated, cross region backup strategy in place today, you are gambling with your business. We build and manage these systems for companies that cannot afford to lose their data. Reach out to us and we will audit your current backup setup and close the gaps before they become emergencies.

Ready to Build?

Let us talk about your project

We take on 3-4 projects at a time. Get an honest assessment within 24 hours.