Every growing application eventually faces the same problem: business logic that changes faster than your team can ship code. An approval workflow that had two steps now needs five. A pricing rule that was simple multiplication now depends on customer tier, geography, and contract terms. A notification that went to one person now needs to route through a hierarchy. When these changes require pull requests, QA cycles, and deployments, your engineering team becomes a bottleneck for business decisions.
A well designed workflow engine solves this by making business logic configurable rather than hard coded. Not a low code platform that replaces developers entirely, but an architecture that lets authorized users modify rules, approval chains, and routing logic without changing application code. We build these systems regularly as part of our system architecture practice, and the difference between a good workflow engine and a mess of spaghetti conditionals is usually about 3 months of accumulated technical debt.
When You Need a Workflow Engine
Not every application needs one. If your business logic is stable, well understood, and changes less than once a quarter, just write code. A workflow engine adds abstraction overhead that is only worth it when the logic changes frequently or varies significantly between customers. Clear signals you need one include:
- Non technical stakeholders regularly request logic changes that require developer time
- Multi tenant SaaS where different customers need different approval chains or routing rules
- Compliance workflows where auditors need to see and verify the exact rules applied to each decision
- Operations teams spending hours on manual steps that follow a predictable pattern
If three or more of these apply, a workflow engine will pay for itself within months. We have seen this pattern repeatedly in the SaaS products we help teams build.
State Machines as the Foundation
The core of any workflow engine is a finite state machine. Every entity in the workflow (an order, an approval request, a support ticket) exists in exactly one state at any time, and transitions between states are governed by defined rules. This sounds academic but it is profoundly practical.
Consider an invoice approval workflow. States might include: draft, submitted, under_review, approved, rejected, paid. Transitions define what can happen: a draft can be submitted, a submitted invoice can move to under_review, an under_review invoice can be approved or rejected. Each transition can have guards (conditions that must be true) and actions (side effects that execute on transition).
```
draft -> submitted: guard(amount > 0, has_line_items)
submitted -> under_review: action(assign_reviewer, send_notification)
under_review -> approved: guard(reviewer_is_authorized), action(notify_submitter)
under_review -> rejected: guard(reviewer_is_authorized), action(notify_submitter, request_revision)
approved -> paid: guard(payment_processed), action(update_ledger, send_receipt)
```
Store workflow definitions as JSON or YAML in the database, not in application code. This is what makes the engine configurable. A workflow definition includes the list of states, valid transitions, guard conditions, and actions to execute. The engine reads this definition at runtime and enforces it.
The database schema for this is straightforward. A workflow_definitions table stores the JSON definition with versioning. A workflow_instances table tracks each entity progressing through a workflow, including its current state, the definition version it is using, and a history of state transitions. We discuss related data modeling in our database schema design guide.
Rule Engine for Guard Conditions
Guard conditions determine whether a transition is allowed. Simple guards check a single field ("amount > 1000"). Complex guards combine multiple conditions with AND/OR logic, compare against external data, or evaluate time based rules ("submitted more than 48 hours ago").
There are three levels of sophistication for rule evaluation:
Level 1: Hardcoded evaluators. Define a set of operators (equals, greater_than, contains, is_empty) and let the workflow definition reference them by name. The engine maps these to functions in code. This covers 80% of use cases and is easy to build and test.
Level 2: Expression language. Use a safe expression evaluator like jsonata, expr eval, or a sandboxed subset of JavaScript to evaluate conditions. The workflow definition contains expressions like `amount > 1000 AND department != 'engineering'`. More flexible but requires careful sandboxing to prevent injection attacks.
Level 3: External rule engine. For complex decisioning (insurance underwriting, loan approval, fraud scoring), delegate to a dedicated rules engine. This is typically overkill for most SaaS workflows but necessary in regulated industries.
We recommend starting with Level 1 and adding expression evaluation when hardcoded operators become insufficient. The jump from Level 1 to Level 2 is small. The jump from Level 2 to Level 3 is a major architectural investment that should only happen when the business case is clear. This is the kind of decision we help teams make in our consulting engagements.
Async Execution and Side Effects
Workflow transitions trigger actions: sending emails, calling APIs, updating records, starting background jobs. These actions must be executed reliably, which means handling failures, retries, and timeouts.
The pattern that works is event driven execution. When a transition occurs, the engine publishes an event (e.g., "invoice.approved") to a message queue or event stream. Separate workers consume these events and execute the associated actions. If sending a notification fails, the event goes to a dead letter queue and retries. The workflow state itself has already transitioned, so the user is not blocked.
This separation between state transitions (synchronous, in the database transaction) and side effects (asynchronous, via events) is critical. Mixing them means a failed email send can roll back a state transition, leaving the workflow in a confusing state. Keep the state machine pure and let side effects be eventually consistent.
For time based workflows (escalation after 24 hours, auto approval after 72 hours), use a delayed job system. When entering a state, schedule a job to fire at the timeout. If the workflow transitions before the timeout, cancel the job. Tools like BullMQ, pg_cron, or cloud native schedulers (EventBridge Scheduler, Cloud Scheduler) handle this well. We covered related async patterns in our piece on startup architecture mistakes, where missing timeout handling is a common failure mode.
Multi Tenant Workflow Configuration
In a SaaS product, different customers need different workflows. Company A requires two levels of approval for purchases over $5,000. Company B requires three levels for anything over $1,000. Company C does not use approval workflows at all.
Handle this by storing workflow definitions per tenant. Each organization gets its own workflow definition that can be customized through an admin interface. Provide sensible defaults (a starter workflow definition that new tenants inherit) and let admins modify transitions, guards, and notification rules.
The admin interface should make it impossible to create invalid workflows. Validate that every state is reachable, there are no dead end states (states with no outgoing transitions except terminal states), and guard conditions reference valid fields. A visual workflow builder using a library like React Flow gives non technical users a clear picture of the process.
This is where workflow engines create the most value in SaaS products. Instead of building custom logic for each enterprise customer, you build the engine once and let customers configure their own processes. Sales cycles shorten because the answer to "can you customize this workflow?" is always yes.
Versioning and Migration
Workflow definitions change. When they do, you need to decide what happens to instances already in progress. The safest approach is version pinning: each workflow instance records the definition version it started with and continues using that version until completion. New instances use the latest version. This prevents mid flight changes from breaking active workflows.
When a definition update is critical (fixing a security flaw in a guard condition, for example), provide an admin tool to migrate in progress instances to the new version. This requires mapping old states to new states and handling cases where the current state no longer exists in the new definition. Always log these migrations in your audit trail.
Observability
A workflow engine without observability is a black box. Track these metrics:
- Instances per state: How many items are in each state right now? A growing backlog in "under_review" means your review team is falling behind.
- Transition throughput: How many transitions per hour? Drops indicate problems.
- Time in state: How long do instances stay in each state? This reveals bottlenecks.
- Failed actions: Which side effects are failing and why?
Surface these in a dashboard. Connect them to monitoring and alerting so your team knows when workflows are stuck before customers complain.
Build vs Buy
Off the shelf workflow engines like Temporal, Camunda, or n8n solve the orchestration problem but add operational overhead and may not fit your data model. Building a custom engine takes 2 to 4 weeks for a solid v1 (state machine, guard evaluation, async actions, admin UI) and gives you full control over the data model and user experience. For most SaaS products, we recommend building a custom engine that fits naturally into your existing stack, compared to the integration overhead and licensing costs of external tools versus custom software.
If you are building a product where configurable business logic is a competitive advantage, talk to us about the architecture. We have shipped workflow engines for operations platforms, fintech products, and enterprise SaaS, and we can help you build one that grows with your business.