Every business application eventually needs to generate documents. Invoices, contracts, reports, shipping labels, compliance forms, certificates. PDF generation sounds simple until you face real requirements: pixel perfect layouts matching brand guidelines, dynamic tables that span multiple pages, embedded charts, digital signatures, and generation at scale where 10,000 invoices need to render in under 5 minutes.
We have built document automation systems for SaaS platforms, e commerce operations, healthcare providers, and financial services. The technology choices and architecture patterns vary significantly based on volume, complexity, and customization needs.
The Three Approaches to PDF Generation
HTML to PDF Conversion
The most common and usually the best starting point. You build a template in HTML and CSS, populate it with data, and convert it to PDF using a headless browser (Puppeteer, Playwright) or a dedicated conversion service.
Advantages: Your team already knows HTML and CSS. Templates are easy to build, preview in a browser, and iterate on. Responsive layouts, web fonts, flexbox, and grid all work. You can even use React or Vue components as templates.
Disadvantages: Headless browsers are resource heavy. Puppeteer consumes 100 to 300MB of RAM per instance. Page break control is limited to CSS `break-before` and `break-after` properties, which do not handle every edge case. Complex multi page layouts with headers, footers, and page numbers require workarounds.
We use HTML to PDF for 70 percent of our document generation projects. For invoices, simple reports, receipts, and marketing materials, it is the fastest path to production quality output.
Programmatic PDF Libraries
Libraries like PDFKit (Node.js), ReportLab (Python), or iText (Java) give you direct control over every element on the page. You position text, draw lines, place images, and manage page breaks programmatically.
Advantages: Complete control over layout. No browser dependency means lower resource consumption. Better handling of complex page layouts with running headers, footers, and page numbers. Smaller deployment footprint.
Disadvantages: Building a template means writing code, not designing in HTML. Simple changes like moving a logo require code changes and redeployment. Development is slower because there is no visual preview during iteration.
We use programmatic libraries for documents with strict formatting requirements: financial statements with precise table layouts, government forms that must match exact specifications, and high volume generation where browser overhead is a bottleneck.
Template Based Generation
Services like Carbone, DocuPDF, or Docmosis let you design templates in Word, Excel, or LibreOffice with placeholder tags. At runtime, you merge data with the template to produce a PDF. This puts template control in the hands of non developers.
Advantages: Business users can modify templates without developer involvement. Template design happens in familiar tools. Good for organizations that generate many similar documents with minor variations.
Disadvantages: Template syntax can be limiting for complex logic (conditional sections, nested loops). Debugging template errors is harder than debugging code. You depend on a third party service or a self hosted conversion engine.
We use template based generation when the client's team needs to modify document layouts independently. Insurance companies, legal firms, and HR departments often prefer this approach because their compliance and legal teams can update templates directly.
Architecture for Production Document Generation
Async Generation with Queues
Never generate PDFs synchronously in an API request handler. A complex document can take 2 to 10 seconds to render. During that time, your HTTP connection is occupied, your server thread is blocked, and your user is staring at a spinner.
The pattern we implement on every document generation system:
1. User requests a document through the API.
2. API creates a job in a queue (Bull, SQS, or Supabase Edge Functions) and returns a job ID immediately.
3. A worker picks up the job, generates the PDF, and stores it in object storage (S3, Supabase Storage, R2).
4. The worker updates the job status and optionally notifies the user via WebSocket or email.
5. User downloads the completed document from a signed URL.
This pattern handles load spikes gracefully. If 500 users request invoices simultaneously, the queue absorbs the burst and workers process them at a sustainable rate. Without the queue, 500 simultaneous Puppeteer instances would crash your server.
Template Management
For applications where templates change over time, version your templates:
- Store templates in your database or object storage with version numbers.
- When generating a document, always record which template version was used.
- Never delete old template versions. A customer requesting a reprint of last year's invoice needs the template that was active when it was originally generated.
- Provide a template preview that shows a sample document with test data before activating a new version.
Batch Generation
Monthly invoice runs, quarterly reports, annual statements. Batch generation is a different problem than on demand generation because you are optimizing for throughput, not latency.
Reuse browser instances instead of launching a new Puppeteer instance per document. A single browser with multiple pages can generate documents in parallel, reducing the per document overhead from seconds to hundreds of milliseconds.
Parallelize with worker pools. On a recent system architecture project, we built a batch invoice system that generates 50,000 invoices monthly. Four worker processes, each running a Puppeteer instance with 5 concurrent pages, generate the full batch in under 40 minutes. The same system generating sequentially would take over 12 hours.
Stream results to storage instead of holding them in memory. Generate a PDF, upload it to S3, release the memory, move to the next document. This keeps memory consumption flat regardless of batch size.
Handling Complex Layout Requirements
Multi Page Tables
Tables that span multiple pages are the most common layout challenge. You need the header row to repeat on each page, rows should not split mid cell, and the footer needs to show "Page X of Y."
With HTML to PDF, the CSS `thead { display: table-header-group; }` property repeats headers. For row splitting, `tr { break-inside: avoid; }` prevents breaks mid row. Page counters use CSS `@page` rules with `counter(page)` and `counter(pages)`.
With programmatic libraries, you calculate remaining page space before each row and manually trigger page breaks when needed. This is more code but gives you precise control.
Charts and Visualizations
Embedding charts in PDFs requires rendering them as images or SVGs first. We typically:
1. Render charts using a library like Chart.js or D3 in a headless browser.
2. Export as SVG or PNG.
3. Embed the image in the PDF template.
For HTML to PDF, charts render natively in the headless browser because they are just HTML canvas or SVG elements. This is one of the significant advantages of the HTML approach.
Digital Signatures
For legally binding documents, embedding digital signatures involves:
- Visual signature placement (an image of the signature at the specified location).
- Cryptographic signing using a certificate that proves the document has not been modified after signing.
- Timestamp authority integration that proves when the signature was applied.
Libraries like pdf-lib (JavaScript) support both visual placement and cryptographic signing. For enterprise requirements, dedicated signing services like DocuSign or Adobe Sign handle the legal and compliance aspects.
Choosing the Right Approach
The decision matrix we use:
| Requirement | Best approach |
|---|---|
| Simple invoices and receipts | HTML to PDF with Puppeteer |
| Pixel perfect regulated forms | Programmatic library (PDFKit, iText) |
| Business users edit templates | Template based (Carbone, Docmosis) |
| High volume batch generation | Programmatic library with worker pools |
| Documents with embedded charts | HTML to PDF (charts render natively) |
| Multi page financial reports | Hybrid: HTML for content, programmatic for page management |
For many business applications, the answer is HTML to PDF with Puppeteer or Playwright, running in a queue based architecture. It covers 80 percent of use cases with the lowest development cost. We discussed how to evaluate custom development vs SaaS tools in another post, and document generation is one area where custom code typically outperforms off the shelf solutions because every business has unique formatting requirements.
Performance Optimization
A few techniques that make a measurable difference:
- Pre warm browser instances. Keep Puppeteer running instead of launching per request. Cold start is 2 to 3 seconds; warm generation is 200 to 500ms.
- Minimize external resource loading. Inline CSS and convert images to base64 data URIs. Each external HTTP request during rendering adds latency.
- Use PDF/A format for archival documents. It embeds all fonts and ensures the document renders identically in 10 years.
- Compress images before embedding. A 5MB hero image in every invoice adds up fast when generating thousands.
We have seen document generation systems go from 8 seconds per document to under 400ms with these optimizations applied. That is the difference between a batch job that runs in 20 minutes and one that runs overnight.
If you are building document generation into your business application, whether it is invoices, contracts, reports, or compliance forms, reach out to us. We will design a system that handles your volume, meets your formatting requirements, and scales with your business.