Background Jobs and Queue Architecture: A Practical Guide

Veld Systems||7 min read

Every application reaches a point where some operations take too long to handle during a user request. Sending an email should not make the user wait 2 seconds for an SMTP response. Generating a PDF report should not block the API for 30 seconds. Processing a payment webhook should not fail silently because the handler timed out.

Background jobs solve this problem by moving slow, unreliable, or resource intensive operations out of the request/response cycle and into a separate processing layer. The user gets an immediate response. The work happens asynchronously. If it fails, it retries automatically.

We have built queue architectures for applications processing millions of jobs per day and for startups handling a few hundred. The principles are the same at every scale. This post covers how to design a background job system that is reliable, observable, and does not become a maintenance burden.

When You Need Background Jobs

Not every operation belongs in a queue. The rule is simple: if the user does not need to see the result immediately, it should be a background job. Here are the most common use cases we see across projects we have shipped:

Email and notifications. Sending email through an SMTP provider or an API like SendGrid takes 200 to 2000ms. Push notifications, SMS, and webhook deliveries add similar latency. Queue them all. The user should see "Order confirmed" instantly, not after every notification has been delivered.

File processing. Image resizing, PDF generation, CSV imports, and video transcoding are all CPU intensive operations that should never run during an HTTP request. A 50MB CSV import can take minutes. A user uploading that file should see a progress indicator, not a loading spinner that eventually times out.

Third party API calls. External services are unreliable. They have rate limits, they go down, they respond slowly. Wrapping third party calls in background jobs gives you automatic retries, rate limit compliance, and failure isolation. If Stripe's API is slow for 10 minutes, your checkout still works and the payment confirmation processes when the API recovers.

Data aggregation and reports. Calculating analytics dashboards, generating financial reports, and building search indexes are operations that can run on their own schedule. A nightly job that rebuilds your reporting tables is simpler and more reliable than computing aggregates in real time on every request. We covered how to design your database schema to support both transactional and analytical workloads.

Scheduled tasks. Subscription renewals, trial expiration notices, data retention cleanup, and periodic health checks are all jobs that run on a schedule. They are not triggered by user actions but need to happen reliably and on time.

Choosing a Queue Technology

The queue technology landscape has a lot of options. Here is what we recommend based on years of production experience:

PostgreSQL based queues (pgboss, Graphile Worker, PGMQ). If your application already uses PostgreSQL, and most of ours do, a Postgres based queue is the simplest starting point. Your jobs are stored in the same database as your application data, which means they participate in transactions. You can enqueue a job and insert a record atomically. No separate infrastructure to manage. No eventual consistency issues. For applications processing up to 50,000 jobs per hour, this is our default recommendation.

Redis based queues (BullMQ, Bee Queue). Redis gives you faster throughput than Postgres and supports more sophisticated features like priority queues, rate limiting, and delayed jobs. BullMQ is the standard for Node.js applications. The tradeoff is that Redis is an additional piece of infrastructure you need to operate, and Redis persistence is not as durable as Postgres. For applications processing 50,000 to 500,000 jobs per hour, Redis queues are the sweet spot.

Managed services (AWS SQS, Google Cloud Tasks, Azure Queue Storage). If you are already on a cloud provider and want zero infrastructure management, managed queues are a solid choice. SQS in particular is effectively infinitely scalable and costs almost nothing at moderate volumes. The tradeoff is vendor lock in and less flexibility in job scheduling and retry logic.

Kafka or RabbitMQ. For event driven architectures processing millions of events per hour with complex routing, replay requirements, and multi consumer patterns, Kafka or RabbitMQ become necessary. But they are significantly more complex to operate. Do not reach for them unless your scale genuinely demands it. We discuss when this level of architecture is warranted in our scaling guide.

Designing for Reliability

The single most important property of a background job system is reliability. A job that is enqueued must eventually be processed, even if workers crash, deployments happen, or infrastructure fails temporarily.

Idempotency is not optional. Every job handler must be safe to run more than once. Network failures, worker crashes, and timeout based retries all mean that a job might be delivered twice. If your "send welcome email" job is not idempotent, a retry will send two emails. Use deduplication keys, check for existing records before creating new ones, and design every handler to produce the same result regardless of how many times it runs.

Retry with exponential backoff. When a job fails, retry it with increasing delays: 1 second, then 10 seconds, then 1 minute, then 10 minutes. This gives transient failures (network blips, rate limits, brief outages) time to resolve without hammering the failing service. Set a maximum retry count, typically 5 to 10, and move permanently failed jobs to a dead letter queue for manual review.

Dead letter queues. Jobs that exhaust their retries need to go somewhere visible. A dead letter queue holds these failed jobs so you can inspect them, fix the underlying issue, and replay them. Without a dead letter queue, failed jobs silently disappear and the problems they represent go unnoticed until a customer complains. In our experience, monitoring the dead letter queue is the single most important operational practice for any background job system.

Transactional enqueuing. When possible, enqueue jobs within the same database transaction as the action that triggers them. If you create an order and then enqueue a "send confirmation email" job, and the enqueue fails after the order is committed, the customer never gets their confirmation. With Postgres based queues, this is trivial because the job lives in the same database. With external queues, you need the outbox pattern: write the job to an outbox table in your database transaction, then have a separate process poll the outbox and forward jobs to the queue.

Observability and Monitoring

A queue system without monitoring is a queue system that will fail silently. At minimum, you need visibility into:

Queue depth. How many jobs are waiting to be processed? If the depth is growing faster than workers can drain it, you have a throughput problem. Set alerts for queue depth exceeding your expected baseline by 2x to 3x.

Processing latency. How long does it take from enqueue to completion? This should be stable and predictable. A sudden increase means either workers are overloaded or job processing time has increased, both of which need investigation.

Error rates. What percentage of jobs are failing? A steady baseline of 0.1% might be normal. A spike to 5% means something broke. Break this down by job type so you can identify which specific handler is problematic.

Worker health. Are all workers alive and processing? If a worker goes down and nobody notices, queue depth will climb until the remaining workers cannot keep up. Health checks and auto restart policies handle this automatically.

We typically integrate job monitoring into the same observability stack used for the rest of the application. Structured logs, metrics dashboards, and alert policies all apply to background jobs just as they apply to API endpoints. Our cloud and DevOps practice includes queue monitoring setup as a standard part of any deployment.

Common Patterns We Use

Fan out. A single event triggers multiple jobs. A new order enqueues: send confirmation email, update inventory, notify warehouse, create invoice, trigger analytics event. Each job runs independently and fails independently.

Chain. Jobs execute in sequence where each depends on the previous result. Process payment, then generate invoice, then send receipt. If payment processing fails, the subsequent jobs never run.

Batch processing. Instead of processing items one at a time, accumulate them and process in batches. This is common for database writes, API calls with batch endpoints, and notification digests. Processing 100 items in one batch is often 10x faster than processing them individually due to reduced overhead.

Scheduled jobs with distributed locks. For jobs that run on a schedule (like nightly reports), you need to ensure only one instance runs even if you have multiple workers. Distributed locks using your database or Redis prevent duplicate execution.

Getting Started Without Over Engineering

If you are building a new application and anticipate needing background jobs, start simple. A Postgres based queue with 1 to 2 worker processes will handle most early stage workloads. Build your job handlers to be idempotent from day one because retrofitting idempotency is painful. Add monitoring early because debugging queue issues without observability is guesswork.

You can always migrate to Redis or a managed queue later when your throughput demands it. The job handlers do not change. Only the transport layer does. This is one of the advantages of clean system architecture, separating concerns so that infrastructure decisions can evolve without rewriting business logic.

If you are dealing with queue reliability issues, processing bottlenecks, or need to design a background job system from scratch, get in touch. We will architect a solution that handles your current load and scales cleanly as your product grows.

Ready to Build?

Let us talk about your project

We take on 3-4 projects at a time. Get an honest assessment within 24 hours.