PDF OCR
Scanned PDF → searchable text
What is PDF OCR?
Run optical character recognition on a PDF or image, entirely in your browser — no upload, no third-party API. Uses Tesseract.js, downloads the selected language model once (~10 MB, cached for future runs) and then works fully offline. Rasterises PDFs at 2× for accuracy, shows per-page progress, and produces plain-text output ready for copy / download.
How do I use PDF OCR?
- Drop an image or PDF.
- Pick the language (English, Spanish, French, German, Italian, Portuguese).
- Wait while the model downloads on first use, then watch per-page progress.
- Copy or download the extracted text.
When should I use PDF OCR?
OCR is for getting text out of pictures of text. For rendering PDF pages as images (without extracting text), use PDF to JPG. For plain-text PDF → text, most PDFs have a real text layer and `pdf-lib` extraction is faster; OCR is the fallback for scans.
What is OCR?
Optical Character Recognition — extracting machine-readable text from images or scanned documents. Turns a picture of text into text you can search, copy, or edit.
Which languages are supported?
English, Spanish, French, German, Italian, and Portuguese out of the box. Other languages can be added via the Tesseract language pack picker (downloaded on demand).
How accurate is it?
Very accurate on clean, high-contrast scans (95%+). Handwriting, low-res scans, and heavily rotated pages drop accuracy significantly — deskew your scan first for best results.
Does it run on-device?
Yes. Tesseract.js runs the full OCR pipeline in your browser. The language model downloads once (~10 MB per language) and is cached locally.
Is my file uploaded anywhere?
No. Everything runs in your browser. Your files never leave your device, and there is no server component for this tool.