Parsing and Generating PDFs in PHP: A Practical Library Comparison

Picking a PHP PDF library is the decision that quietly determines whether your invoices render correctly and whether your server falls over under load. Generating a PDF from HTML has four serious options — dompdf, mpdf, wkhtmltopdf via Snappy, and headless Chrome via spatie/browsershot — and they differ enormously in CSS fidelity, performance, and memory footprint. Parsing an existing PDF to pull text back out is a completely separate problem with a different tool, smalot/pdfparser. Pick the wrong generator and you spend a week fighting a flexbox layout that will never work, or you watch php-fpm OOM on a 200-row report. Here is how I choose, with the gotchas that actually bite in production.

Which library should I use to generate a PDF from HTML?

There is no single best option — there is the right tradeoff for your fidelity, performance, and operational constraints. The split that matters most is whether the library renders HTML/CSS itself in pure PHP, or shells out to an external rendering engine. Pure-PHP engines (dompdf, mpdf) are easy to deploy because they have no system dependencies, but they implement their own limited CSS support. External engines (wkhtmltopdf, headless Chrome) render with a real browser-grade layout engine, so modern CSS works, but now you are shipping and securing a binary.

dompdf — pure PHP, zero binaries, trivial to install. CSS support is roughly CSS 2.1 with a little CSS3. No flexbox, no grid, shaky float behavior. Fine for simple documents.
mpdf — pure PHP, much stronger CSS than dompdf (good table handling, RTL, embedded fonts, page headers/footers). Heavier and slower, and memory climbs fast on long documents.
wkhtmltopdf via knplabs/knp-snappy — uses a WebKit binary. Decent fidelity for older CSS, fast enough. The catch: wkhtmltopdf itself is archived and the binary is effectively unmaintained, so I no longer reach for it on new work.
spatie/browsershot — drives headless Chromium through Puppeteer (Node). Best modern-CSS fidelity by a wide margin: flexbox, grid, web fonts, even JS-rendered charts. Cost is the heaviest dependency chain — Node plus a Chromium install.

My default rule: if the layout is anything a designer touched — an invoice with a real grid, a branded report, a certificate — I reach for Browsershot. If it is a plain server-side document and I cannot install binaries (locked-down shared hosting, a minimal container I do not control), dompdf is the pragmatic choice. mpdf sits in the middle when I need solid table-heavy output and good font control without a Node toolchain.

What does a minimal dompdf example look like?

dompdf is the fastest thing to get running because it has no system dependencies. Install it, hand it an HTML string, get bytes back. This is the entire flow for a simple document. One deliberate choice up front: leave isRemoteEnabled off unless you trust the HTML, because enabling remote assets lets a crafted payload pull in arbitrary URLs, which is an SSRF vector when the HTML is user-influenced.

bash

composer require dompdf/dompdf

app/Services/SimpleInvoicePdf.php

<?php

namespace App\Services;

use Dompdf\Dompdf;
use Dompdf\Options;

class SimpleInvoicePdf
{
    public function render(string $html): string
    {
        $options = new Options();
        $options->set('defaultFont', 'DejaVu Sans');
        // Only enable if you trust the HTML — remote assets are an SSRF risk.
        $options->set('isRemoteEnabled', false);

        $dompdf = new Dompdf($options);
        $dompdf->loadHtml($html);
        $dompdf->setPaper('A4', 'portrait');
        $dompdf->render();

        return $dompdf->output(); // raw PDF bytes
    }
}

Two gotchas that cost people an afternoon. First: dompdf's default font does not render UTF-8 glyphs like £, €, or Bengali and Arabic scripts — set defaultFont to DejaVu Sans (which ships with dompdf) or embed your own font, or you get blank boxes. Second: do not try to lay out an invoice with flexbox. It silently does nothing. Use tables for structure, the way you would have in 2008. If that constraint hurts, that is your signal to move to Browsershot.

How do I generate a high-fidelity PDF with Browsershot?

When the output needs to match a real design — modern CSS, web fonts, a logo that lines up to the pixel — I use spatie/browsershot. It renders your HTML in headless Chromium, so what you see in the browser is what lands in the PDF. The price of admission is the dependency chain: you need Node, Puppeteer, and a working Chromium on the box.

bash

composer require spatie/browsershot
npm install puppeteer

app/Services/InvoicePdf.php

<?php

namespace App\Services;

use Spatie\Browsershot\Browsershot;

class InvoicePdf
{
    public function render(string $html, string $outputPath): void
    {
        Browsershot::html($html)
            ->format('A4')
            ->showBackground()          // honour background colours/images
            ->margins(10, 10, 10, 10)   // top, right, bottom, left (mm)
            ->waitUntilNetworkIdle()    // let web fonts / images finish
            ->noSandbox()               // typically required on a server
            ->timeout(120)
            ->save($outputPath);
    }
}

Code editor on a laptop showing HTML and CSS, representing rendering an invoice template through headless Chromium with Browsershot — Browsershot renders your HTML in real Chromium — the layout you debug in the browser is the layout you ship in the PDF.

The fidelity is excellent, but be honest about the operational cost. Each render spawns a Chromium process that can hold 150–300 MB of RAM, and the launch alone takes a few hundred milliseconds before any rendering happens. On a server, Chromium usually needs the --no-sandbox flag — use the dedicated noSandbox() method (or addChromiumArguments(['--no-sandbox'])) rather than guessing at setOption — or a properly configured sandbox, plus the system font packages installed or your text falls back to ugly defaults. This is the library I trust for correctness and the one I least want to run inside a synchronous web request.

How do I parse text out of an existing PDF?

Generation and extraction are opposite problems, and you cannot solve extraction with a generator. To read text back out of a PDF — searching uploaded documents, pulling totals off a supplier invoice, indexing contracts — I use smalot/pdfparser. It is pure PHP, handles most text-based PDFs, and gives you both whole-document text and per-page access.

bash

composer require smalot/pdfparser

app/Services/PdfTextExtractor.php

<?php

namespace App\Services;

use Smalot\PdfParser\Parser;

class PdfTextExtractor
{
    public function extract(string $path): string
    {
        $parser   = new Parser();
        $document = $parser->parseFile($path);

        return $document->getText();
    }

    /** @return array<int,string> text keyed by page number (1-based) */
    public function perPage(string $path): array
    {
        $document = (new Parser())->parseFile($path);
        $out = [];

        foreach ($document->getPages() as $i => $page) {
            $out[$i + 1] = $page->getText();
        }

        return $out;
    }
}

The hard limit to internalise: smalot/pdfparser only reads PDFs that contain a real text layer. Feed it a scanned document — which is just an image wrapped in a PDF — and getText() returns nothing, because there is no text to extract. That is not a bug; OCR is a different job entirely (Tesseract territory). Also expect spacing and column order to come back imperfect on complex multi-column layouts; a PDF stores positioned glyphs, not paragraphs, so reconstructing reading order is genuinely hard. Plan to normalise whitespace and validate with a regex rather than assuming clean output.

A PDF generator and a PDF parser solve opposite problems — never expect one library to do both, and never expect a parser to read a scanned image.Md Raihan Hasan

How do I keep PDF generation from killing my server?

This is where teams get hurt. Every one of these libraries is memory-hungry, and the cost scales with document size. A 3-page invoice is fine inline; a 300-row report rendered through mpdf or Browsershot inside a web request will spike memory, blow past your php-fpm timeout, and return a 504 to the user while the process keeps churning. The fix is never to bump memory_limit — it is to get the work off the request entirely.

Generate PDFs in a queued job, not in the controller. The request returns immediately and the heavy render happens on a worker — see my walkthrough on running Laravel queue workers under Supervisor in production so those jobs survive deploys and reboots.
For Browsershot, each job should launch and tear down its own Chromium so a leaked process does not pin RAM. Watch for orphaned chrome processes; they are the classic silent memory leak on a worker box.
For mpdf, memory grows roughly linearly with row count. For genuinely large reports, paginate the source data and write the PDF in chunks rather than building one giant HTML string.
Cap concurrency. Four Browsershot jobs running at once is four Chromiums — size your worker count against available RAM, not CPU.

If your PDFs are long-lived background work, the same memory discipline that applies to any long-running PHP process applies here — bounding worker lifetime and watching resident memory matters as much as the library choice. I go deeper on that in my notes on PHP memory management in long-running scripts.

What about filling existing PDF forms?

If your task is not 'render HTML to PDF' but 'take a fixed government or bank PDF and fill in its fields', none of these generators is the right tool. AcroForm field-filling is its own technique with its own library, and I cover the full approach — flattening, field mapping, and the FDF gotchas — separately in filling PDF forms in PHP with AcroForm. Reach for that when you are populating a pre-built template, not designing one.

Choose by constraint, not by hype. No binaries allowed and the document is simple: dompdf, and live with tables-for-layout. Table-heavy output with good fonts and no Node: mpdf. A design that has to look right: Browsershot, and treat the Chromium dependency as a first-class operational concern — install fonts, call noSandbox() correctly, and queue the work. Reading text out of existing PDFs: smalot/pdfparser, knowing it cannot touch scans. Get the generation off the web request and into a queued job from day one, and the library you picked stops being the thing that wakes you up at 3am.

Let's Connect

Parsing and Generating PDFs in PHP: A Practical Library Comparison

Which library should I use to generate a PDF from HTML?

What does a minimal dompdf example look like?

How do I generate a high-fidelity PDF with Browsershot?

How do I parse text out of an existing PDF?

How do I keep PDF generation from killing my server?

What about filling existing PDF forms?

How to Build Fillable PDF Forms Programmatically (AcroForm Basics)

PHP Cron Jobs vs Laravel Scheduler: Which to Use and When

Search

Category

Latest Articles

Laravel Queue Workers on Production: Supervisor Setup That Actually Survives Reboots

Auto-Filing Email Attachments in Laravel with Google Workspace IMAP

Laravel API Authentication: Sanctum vs Passport vs JWT in 2026

Need a Full-Stack Developer?

Let's Connect

Parsing and Generating PDFs in PHP: A Practical Library Comparison

Md Raihan Hasan

May 24, 2026

9 min read

Which library should I use to generate a PDF from HTML?

What does a minimal dompdf example look like?

How do I generate a high-fidelity PDF with Browsershot?

How do I parse text out of an existing PDF?

How do I keep PDF generation from killing my server?

What about filling existing PDF forms?

How to Build Fillable PDF Forms Programmatically (AcroForm Basics)

PHP Cron Jobs vs Laravel Scheduler: Which to Use and When

Search

Category

Latest Articles

Laravel Queue Workers on Production: Supervisor Setup That Actually Survives Reboots

Auto-Filing Email Attachments in Laravel with Google Workspace IMAP

Laravel API Authentication: Sanctum vs Passport vs JWT in 2026

Popular Tags

Need a Full-Stack Developer?