Webhook Architecture: How to Build Reliable Integrations

Veld Systems||7 min read

Webhooks are the simplest integration pattern in modern software and also the most frequently broken. The concept is straightforward: when something happens in your system, send an HTTP POST to a URL the consumer provides. Every payment processor, CRM, and SaaS tool uses them. They are the backbone of the modern integration ecosystem.

The problem is that "send an HTTP POST" is where most teams stop thinking about the architecture. They fire off an HTTP request from their application code, maybe log whether it succeeded, and move on. Then the receiving server goes down for 30 seconds and they lose events. Or the consumer's endpoint is slow and it backs up their entire request queue. Or they deploy a change that subtly alters the payload format and break every integration without knowing.

Building webhooks that are reliable enough for production is not hard. It just requires treating webhook delivery as its own system, not an afterthought bolted onto your API. We have built webhook infrastructure for products handling millions of events per month, and the patterns are consistent.

The Core Architecture

A reliable webhook system has four components: event generation, a delivery queue, the delivery engine, and the observability layer. Trying to collapse these into "just send an HTTP request" is where reliability falls apart.

Event Generation

When something happens in your system (a payment succeeds, a user signs up, an order ships), you create a webhook event record. This record contains the event type, the payload, a timestamp, and a unique event ID. The critical detail is that creating this record is part of the same database transaction as the action that triggered it.

If a payment succeeds and you commit that to your database, the webhook event record must be committed in the same transaction. If you generate the webhook event asynchronously after the transaction commits, you have a window where the payment succeeded but no webhook event was created. That window will eventually bite you.

This is the transactional outbox pattern, and it is the foundation of reliable event delivery. Your events table becomes the source of truth for what needs to be delivered, and a separate process reads from that table and handles delivery.

The Delivery Queue

Your delivery engine reads undelivered events from the outbox and sends them. This should be a background worker, not part of your web request lifecycle. Webhook delivery should never block or slow down your API responses.

The queue needs to support:

- Ordering guarantees per subscriber. Events for the same webhook endpoint should be delivered in order. Events for different endpoints are independent.

- Retry with backoff. When delivery fails, retry with exponential backoff (1 minute, 5 minutes, 30 minutes, 2 hours, 24 hours). After a configurable number of failures, mark the endpoint as unhealthy and stop attempting delivery.

- Concurrency limits per endpoint. If a consumer's endpoint is slow, do not let it consume all your delivery workers. Limit concurrent deliveries per endpoint so one slow consumer does not affect others.

The Delivery Engine

The actual HTTP delivery needs to be more thoughtful than a simple POST request. Here is what your delivery engine should handle:

Timeouts. Set an aggressive timeout on webhook deliveries, typically 5 to 10 seconds. If the consumer's server does not respond in that window, treat it as a failure and retry. Do not let a hanging connection tie up your workers.

Signature verification. Every webhook request should include a cryptographic signature (HMAC-SHA256 is standard) that the consumer can verify. This proves the request came from your system and has not been tampered with. Include a timestamp in the signed payload to prevent replay attacks.

Idempotency keys. Include a unique event ID in every delivery. Consumers should use this to deduplicate events. Your system will sometimes deliver the same event twice, especially around retries and network failures. This is not a bug. It is an expected behavior that the consumer needs to handle.

Payload versioning. Include a version identifier in your webhook payload. When you need to change the payload format, increment the version and give consumers time to update their handlers. Breaking webhook payloads without warning is one of the fastest ways to destroy trust with your integration partners.

Handling Failures at Scale

Webhook failures are not exceptional. They are routine. Servers go down. Networks partition. SSL certificates expire. Consumers deploy broken code. Your system needs to handle all of this gracefully.

Circuit breakers per endpoint. If an endpoint fails consistently (say, 5 consecutive failures), stop attempting delivery and mark the endpoint as unhealthy. Notify the consumer through whatever channel you have (email, dashboard notification, in app alert). Periodically send a test delivery to check if the endpoint has recovered.

Dead letter queues. Events that exhaust all retry attempts go into a dead letter queue. The consumer should be able to view these events in a dashboard and replay them once they have fixed their endpoint. This is essential. Without it, failed events are just lost.

Endpoint health dashboards. Give your consumers visibility into their webhook health. Show them delivery success rates, average response times, recent failures, and the current status of their endpoint. This reduces support tickets dramatically because consumers can diagnose their own issues.

If you are building a product that will depend on integrations like this, our system architecture service covers these design decisions in depth before any code is written.

Security Considerations

Webhooks introduce a unique security surface because you are sending data to URLs your consumers control. This means:

Validate destination URLs. Do not allow webhooks to private IP ranges (10.x.x.x, 192.168.x.x, 127.0.0.1) or metadata endpoints (169.254.169.254). This prevents SSRF attacks where a malicious consumer uses your webhook system to probe your internal network.

Use HTTPS only. Do not deliver webhooks over plain HTTP. The payload often contains sensitive data (user information, payment details, API keys), and sending it unencrypted over the internet is a security incident waiting to happen.

Rotate signing secrets. Give consumers the ability to rotate their webhook signing secret without downtime. Support a grace period where both the old and new secrets are valid, so the consumer can update their verification code before the old secret expires. We include this kind of thinking in every project, aligned with the patterns in our web app security checklist.

Limit payload size. Do not send entire objects in webhook payloads if they can be large. Send the event type and the resource ID, and let the consumer fetch the full resource through your API. This keeps payloads small, reduces bandwidth, and means the consumer always gets the latest state of the resource.

What Your Consumer Experience Should Look Like

The difference between a webhook system that developers love and one they dread is the consumer experience. Build these features:

A testing tool. Let consumers send test events to their endpoint from your dashboard. This lets them verify their handler works before going live.

Event logs with payloads. Show consumers a log of every event sent to their endpoint, including the full payload, the response status code, and the response body. This makes debugging trivial.

Retry controls. Let consumers manually retry a specific failed delivery. Sometimes the fix is on their end and they just need to replay the event.

Webhook management API. In addition to a dashboard, provide an API for managing webhook subscriptions. Developer tools that integrate with your product will need to register webhooks programmatically, not through a UI. Good API design for this follows the same principles we cover in our API best practices guide.

Build vs. Buy

For many products, the right answer is to build a lightweight webhook system using the patterns above. If you are using PostgreSQL, the transactional outbox pattern with a background worker is straightforward to implement and gives you full control.

If you are processing a high volume of events (millions per day) and do not want to operate the delivery infrastructure, managed webhook delivery services exist. The tradeoff is cost and a dependency on another vendor versus not having to build retry logic, monitoring, and scaling yourself.

Our general recommendation: build the event generation and queue yourself (this is core to your product and tightly coupled to your data model) and evaluate managed services for delivery at scale if operational burden becomes a concern.

Getting It Right the First Time

Webhooks are the kind of system where getting the architecture right early saves enormous pain later. Retrofitting reliability (retries, idempotency, dead letter queues) onto a fire and forget webhook implementation means migrating every existing integration, and that is never fun.

If you are designing a product that needs reliable webhook delivery or integrating with systems that rely on them, talk to our team. We will architect a webhook system that your integration partners can depend on.

Ready to Build?

Let us talk about your project

We take on 3-4 projects at a time. Get an honest assessment within 24 hours.