How PDF compression actually works (and why your scans barely shrink) | TheStuHub Blog

What is inside a PDF file?

A PDF is a container. Inside it you will find page descriptions (positioned text runs + drawing operators), embedded fonts (or font subsets), images (stored as-is in their native encoding — usually JPG or FlateDecode PNG), metadata (title, author, creation date), and bookkeeping structures (the cross-reference table, object streams). The total file size is the sum of all of these.

Compression works by making each of those categories smaller — but the maximum savings per category depends on what is already inside. A PDF that is mostly embedded JPG photos is already image-compressed; a PDF that is mostly text has tiny page-description streams but big font tables. Each file has its own ceiling.

The three compression levers

Three independent techniques account for 95% of the byte reduction any PDF compressor can achieve:

Object-stream compression. The PDF spec allows related small objects to be grouped and compressed together rather than individually — this alone can shrink a text-heavy PDF by 10–25%.
Font subsetting. If a document only uses 47 glyphs from a 200-glyph font, you only need those 47. Aggressive subsetting strips the rest. Savings range from negligible (for heavily-used system fonts) to very large (for ornate display fonts used once).
Image recompression. The single biggest lever for image-heavy PDFs. Every embedded JPG can be re-encoded at a lower quality; every PNG-stored photo can be swapped for a JPG if the document does not need transparency. 50–80% of file size often lives in the embedded images.

Why some PDFs barely shrink

Three situations produce disappointing results — and if you know why, you can often bypass them:

First, PDFs already exported by a modern pipeline (Google Docs, Microsoft Word 2020+, InDesign) have object-stream compression and aggressive font subsetting already applied. There is no fat left. A second pass produces under 5% savings.

Second, scanned documents are essentially hundreds of embedded JPGs — and the scanner app has usually set a reasonable quality level already. The PDF compressor reaches its image-recompression lever, but the gains are modest because the scan was not high-quality to begin with.

Third, PDFs with a lot of vector graphics (maps, blueprints, icon-heavy reports) have large page-description streams that barely compress because the path operators are already terse. You will see savings only on the font + metadata axes.

Strategy for stubborn files

When structural compression fails to hit your target size, the right move is to break the PDF open and compress the images manually:

Extract every page as a high-DPI JPG using PDF to JPG.
Run each image through Image Compressor at 75–85% quality.
Reassemble them into a new PDF with Image to PDF.
The resulting file is often 20–40% smaller than any in-place compressor could achieve.

The trade-off is that you lose the original text layer. For anything you plan to OCR later, keep the original.

Why browser-based PDF compression is a good fit

A privacy-first pipeline — the kind that runs entirely in your browser tab — has one hidden advantage over cloud compressors: it can iterate. If the first compression pass does not hit your target, you can immediately try a different setting without re-uploading. The file is already in memory; re-running the algorithm is free.

Browser-based pipelines are also the only kind that can honestly claim "your document never leaves this device." For confidential contracts, tax forms, or medical records, that guarantee is worth more than a few percent of extra byte-reduction.

Compression is, at its heart, negotiation with the container format. The more you know about what is inside the container, the better deal you can strike.— TheStuHub engineering note, 2026