A scanned PDF looks like a document but behaves like a photograph. You can’t select the text, you can’t search it, you can’t paste a paragraph into an email. The pixels say “INVOICE 2026-0481” but the computer sees an image and shrugs.
OCR — Optical Character Recognition — fixes that by reading the pixels and writing the recognized text back into the file as a hidden text layer. The visible page doesn’t change. What changes is that the document becomes searchable, selectable, and machine-readable.
This guide covers how OCR actually works under the hood, why running it in your browser (via Tesseract.js + WebAssembly) is a meaningful privacy improvement over cloud OCR services, what accuracy you can realistically expect, when in-browser OCR is enough and when you need cloud OCR despite the trade-offs, and the 30-megabyte bundle-size catch we don’t hide from you.
How OCR actually works
OCR isn’t one algorithm — it’s a pipeline of distinct stages. Understanding the pipeline helps you understand which stage is failing when the output is wrong.
Stage 1: Preprocessing. The raw image rarely arrives in a state OCR can work with directly. The preprocessing stage runs several normalization steps:
- Binarization: convert from grayscale or color to pure black-and-white. The simplest approach is global thresholding; modern engines use adaptive thresholding (different threshold per region) to handle uneven lighting.
- Deskewing: detect the dominant text-line angle and rotate the image until lines are horizontal. A 2-degree skew is invisible to a human and catastrophic for line-detection algorithms.
- Despeckling: remove isolated dark pixels (paper grain, scanner noise) that the recognizer would otherwise try to interpret as punctuation.
- Resolution normalization: most recognition models expect input at a specific DPI (~300 for Tesseract). Inputs that are too low are upsampled; inputs that are too high may be downsampled for speed.
Bad preprocessing is the most common cause of bad OCR results. A document that fails at 88% accuracy with bad preprocessing often hits 96% with good preprocessing on identical recognition.
Stage 2: Layout analysis. Before recognizing characters, the engine has to find them. Layout analysis segments the page into regions (text blocks, images, tables, headers, footers), each region into lines, each line into words, each word into characters. Mistakes here cascade — a missed word boundary produces a run-on word the recognizer can’t handle.
Stage 3: Character recognition. This is the part people mean when they say “OCR”. The engine matches the segmented characters against a learned model and outputs the most probable text. Classical Tesseract used per-character feature matching; modern Tesseract (since v4) uses an LSTM neural network that recognizes whole text lines in context, which significantly improves accuracy.
Stage 4: Text reconstruction. The raw character output is reassembled into coherent text: word spacing restored, line breaks preserved or removed based on layout, paragraph structure detected, hyphenated line-end words rejoined. For PDF output, the recognized text is written as a transparent text layer positioned over the original image at the same coordinates — so when you select text in the result, the selection box aligns with the visible glyphs.
Stage 5 (optional): Post-processing. Spell-check against a dictionary, language-specific corrections (a Spanish OCR pass can use a Spanish dictionary to disambiguate similar characters), entity extraction (dates, currency amounts, names). Tesseract.js doesn’t ship strong post-processing; most cloud OCR services do.
Why Tesseract.js is special
Tesseract is one of the oldest and most respected open-source OCR engines. It originated at HP Labs in 1985, was open-sourced in 2005, and has been maintained by Google since 2006. The C++ engine powers a significant portion of the world’s open-source OCR work.
Tesseract.js is the JavaScript/WebAssembly port. It compiles the Tesseract C++ engine to WebAssembly using Emscripten, exposes a JavaScript API, and runs in any browser that supports WASM (every browser from the last 5+ years). According to the Tesseract.js project page, it supports more than 100 languages with automatic text orientation and script detection.
What’s special about it from a privacy perspective: the recognition runs entirely in the browser tab. There’s no server call. The WASM binary executes on your CPU. The image data lives in your tab’s memory. The recognized text is returned to JavaScript and written into the output PDF. At no point does the file touch a network connection.
This is structurally different from how most online OCR tools work. iLovePDF’s OCR, Smallpdf’s OCR, Adobe’s OCR, and OCRSpace all upload your file to their servers, run the recognition there, and return the result. The upload step is unavoidable in their architecture. In Tesseract.js, the upload step is replaced by a one-time download of the WASM engine and the language model files — once cached, all subsequent OCR happens locally with no network activity.
Tesseract.js v6.0.0 (released in 2025) introduced significant improvements in memory management and runtime performance — fixing memory leaks for stable long-running sessions and reducing memory usage for faster recognition. This makes browser OCR practical for multi-page documents in a way that earlier versions weren’t.
Accuracy: what determines it
OCR accuracy is not a single number — it depends on the input. The same engine on a clean printed page hits 99%; on a phone photo of a coffee-stained receipt under bad lighting, it might hit 60%. Industry research consistently puts modern OCR at 95-99% accuracy on clean printed text and significantly lower on degraded inputs.
Five factors dominate:
1. Resolution (DPI). Tesseract works best at 300 DPI. Inputs below 150 DPI produce visible accuracy loss. Inputs above 600 DPI add no benefit and increase processing time. If you control the scan, target 300 DPI.
2. Contrast and lighting. Even illumination, dark text on light background, no glare. Phone photos of paper are common OCR inputs and also common OCR failure cases — uneven phone-camera lighting and shadow gradients reduce accuracy by 10-30 percentage points compared to a flatbed scan of the same document.
3. Image quality. Sharp focus, no motion blur, minimal JPEG compression. A 95-quality JPEG of a 300 DPI scan is fine. A 30-quality JPEG of the same scan introduces blocky artifacts that confuse character segmentation.
4. Language and script. Tesseract.js accuracy on Latin scripts (English, Spanish, French, German, Italian, Portuguese, Indonesian) is the highest because the training data is the largest. Accuracy on CJK scripts (Chinese, Japanese, Korean) is good but more sensitive to input quality. Right-to-left scripts (Arabic, Hebrew) work but the layout analysis is harder. Multi-language documents (e.g. Indonesian text with English technical terms) require running the engine with multiple languages enabled, which slows recognition and sometimes reduces accuracy on each language individually.
5. Document type. Printed text in a clean book or report layout: 95-99%. Forms with hand-printed entries: 70-90%. Tables with merged cells: 60-85%. Handwritten cursive: 30-60%. Receipts with mixed printing, thermal print, and stamps: highly variable, often 75-90% on the printed portions.
If the result on your specific document is below 95% and you’ve already optimized DPI and contrast, the realistic options are: (a) accept it and clean up manually, (b) try a different language model if you have a multilingual document, (c) escalate to cloud OCR, or (d) retype the affected sections. Switching to a different in-browser OCR tool rarely helps because most are wrapping the same Tesseract engine.
The 12 languages we support
We currently ship trained models for 12 languages: English, Spanish, French, German, Italian, Portuguese, Indonesian, Japanese, Korean, Simplified Chinese, Arabic, and Hindi.
The set is deliberate. Each language model adds 5-15 MB to the cached payload, so packaging all 100+ Tesseract languages by default would make the first-load experience unreasonable for users who only need English. Instead we ship the most-requested set and load additional languages on demand when the user picks them. Once a language model is cached, it stays cached and doesn’t re-download.
If your language isn’t in this list and you’d find the tool useful, let us know — we add languages on request, and the cost of adding one is low.
The standard Tesseract model is the open-source tessdata_best set, which produces the highest accuracy. We deliberately do not use the smaller tessdata_fast models — those are 3-4x smaller and run 2x faster but lose 2-5 percentage points of accuracy, which is the wrong trade for a privacy-first tool where the user is already accepting slower local processing in exchange for not uploading.
When in-browser OCR is enough — and when it isn’t
In-browser OCR is enough when:
- The document is in a supported language and uses printed text
- The DPI is reasonable (200+) and the lighting is even
- 95-99% accuracy is acceptable (it usually is — humans tolerate occasional OCR errors better than they tolerate uploading sensitive documents)
- The document is sensitive (medical, financial, legal, personal) and uploading it is the wrong trade
- You’re processing a modest number of pages at a time (say, 1-50 — beyond that, batch processing on a server starts to be meaningfully faster)
You probably want cloud OCR when:
- The document is handwritten and you need high accuracy
- The document contains complex tables that require structured extraction (cell-by-cell, with proper row/column alignment)
- The document is in a language outside the 12 we support
- You’re processing hundreds or thousands of documents and want a server-side queue
- The document quality is poor (low DPI, low contrast, photo-of-paper, partially obscured)
- You need entity extraction (line items, totals, dates parsed into structured fields) — this is where Azure Document Intelligence and AWS Textract pull significantly ahead of bare OCR
For most personal and small-business scanning workflows, in-browser OCR is fine. For invoice-extraction pipelines, healthcare document intake, or research-grade transcription, cloud OCR remains the practical answer despite the privacy implications.
The 30 MB bundle-size trade-off
We’re transparent about a real cost of in-browser OCR: the WebAssembly engine and the language model are larger than a typical web page.
On first use of the OCR tool:
- Tesseract.js core (engine WASM + JS wrapper): ~10 MB
- Default English language model (
eng.traineddatafromtessdata_best): ~15 MB - Combined first-load payload for English OCR: ~25-30 MB
Subsequent visits hit the browser cache. The same is true for additional languages: each language model is downloaded once when first requested and cached thereafter.
This is the cost of doing the work in your browser. A cloud OCR service has none of this download cost on the user — but you pay it back many times over in upload time, every time you process a document. A 30 MB one-time download is roughly the same data transfer as uploading three 10 MB scanned PDFs to a server.
We mitigate the first-load cost in a few ways:
- The OCR engine and language models are lazy-loaded: nothing downloads until you actually click into the OCR tool. If you’re just browsing imisspdf or using a different tool, the OCR payload never touches your connection.
- The download happens after the page is interactive — the tool UI appears immediately and the engine warms up in the background.
- We surface a progress indicator so you can see the download status, rather than hanging on a blank screen.
- Subsequent uses are instant — the WASM and language data are served from your browser’s cache.
If you regularly OCR documents, the first-load cost is a small one-time investment. If you OCR a document once a year, cloud OCR’s “no setup” experience may feel lighter — but you’re paying the upload cost every single time, and you’re uploading your document.
Step-by-step: OCR a scanned PDF with imisspdf
This is the workflow. Total time: about 30 seconds on a modern laptop for a 5-page scanned document, plus the one-time engine download on first visit.
1. Open the tool. Go to imisspdf OCR PDF. The page loads, the UI appears, and the engine begins downloading in the background. On first visit, this takes 10-30 seconds depending on your connection. On subsequent visits, the engine is cached and the tool is ready instantly.
2. Drop your scanned PDF. Drag-and-drop onto the page or use the file picker. The file goes from your disk into your browser’s memory. Nothing leaves your device.
3. Select language. Default is English. If your document is in Spanish, French, German, Italian, Portuguese, Indonesian, Japanese, Korean, Simplified Chinese, Arabic, or Hindi, pick it from the dropdown. For multilingual documents, you can select multiple languages — accuracy on each may drop slightly but the engine will recognize text in any of the selected scripts.
4. Process. Click run OCR. The engine processes each page sequentially. On a modern laptop, that’s about 3-8 seconds per page for English. Phones are slower (10-20 seconds per page). The progress indicator shows current page and estimated time remaining.
5. Download. The result PDF has the original image preserved exactly, with an invisible text layer overlaid. You can now select text, search the document, and copy passages. File size is typically 1-3% larger than the input.
6. Optional: compress. Once OCR is done, you can run the compress tool on the result. The text layer is preserved while the image layer is downsampled — this often produces a smaller file than the original because the OCR step lets the compressor be more aggressive about image resampling. See Compress PDF online free without losing quality.
When the result is wrong
OCR mistakes are not random — they cluster around specific input issues. If you see persistent errors, the fix is usually upstream of the OCR itself:
- Garbled text on every page: the input is rotated 90 or 180 degrees. Rotate before running OCR (some tools auto-detect orientation; if accuracy looks worse than expected, check this first).
- Numbers wrong but letters right: low-DPI input where the digit shapes degrade faster than letter shapes. Re-scan at 300 DPI if possible.
- Some pages worse than others: uneven lighting in a multi-page document. Pre-process each page individually or re-scan with even lighting.
- All quotes look like primes, all em-dashes look like hyphens: this is normal Tesseract behavior; the model is conservative about punctuation. Often easier to fix with find-and-replace than with model changes.
- Tables are nonsense: column-to-column recognition without preserved structure is a known limitation. For structured table extraction, use a specialist tool (Azure Document Intelligence, AWS Textract, or commercial PDF-to-Excel tools).
- Handwriting not recognized: expected — Tesseract is a printed-text engine. For handwriting, use a handwriting-specific service.
Try in-browser OCR
If you have a scanned PDF that you want searchable without uploading it to a third party, try imisspdf OCR PDF →. The first run downloads the engine (~25-30 MB); every subsequent run is instant and stays entirely in your browser. No signup, no upload, no watermark, no per-page limit.
For documents where 95%+ in-browser accuracy isn’t enough — handwriting, complex tables, very poor scans — the cloud route remains a legitimate option and we’d rather you use it than fight a tool that isn’t the right fit. The framing that works best: decide per document. Confidential records that need to be searchable: in-browser OCR. Public archives that need structured extraction: cloud OCR.
Frequently asked questions
The FAQ block at the top of this article covers the most common OCR questions. For related coverage, see How to OCR a scanned PDF and How to OCR scanned PDF online free.
Sources
- Tesseract.js project page
- Tesseract.js GitHub repository
- Tesseract.js v6.0.0 release notes
- robertknight/tesseract-wasm — alternative WASM build
- Transloadit: Integrating OCR in the browser with Tesseract.js
- MDN: MediaDevices and WebAssembly capabilities
- ScanLens: On-Device vs Cloud OCR — Privacy, Speed, Accuracy
Use OCR PDF: Convert scanned PDFs into searchable selectable documents. No signup, nothing uploaded.
Frequently asked questions
Tesseract.js is a JavaScript/WebAssembly port of Google's open-source Tesseract OCR engine. It runs entirely in your browser — the WASM binary executes locally, reads pixels from your file, and outputs recognized text without any server round-trip. Server-based OCR (Google Cloud Vision, AWS Textract, Azure Document Intelligence) uploads your image to the cloud, runs proprietary recognition models, and returns the result. Tesseract.js is privacy-preserving by design; cloud OCR is typically more accurate on degraded or complex inputs, especially handwriting and tables.
Close, but not identical. On clean, high-DPI printed text in supported languages, Tesseract.js reaches 95-99% character accuracy — the same range as Google Cloud Vision or Azure Document Intelligence for that kind of input. The accuracy gap widens on degraded inputs: low-light photos, low-DPI scans, mixed-language documents, complex tables, and handwriting. Cloud OCR providers maintain larger proprietary recognition models trained on more diverse data, so they're more robust to bad inputs. For most office documents and clean scans, the in-browser engine produces results that are functionally equivalent for searching, copying, and indexing.
We currently package 12 high-quality language models: English, Spanish, French, German, Italian, Portuguese, Indonesian, Japanese, Korean, Simplified Chinese, Arabic, and Hindi. The underlying Tesseract project supports over 100 languages — we limit to 12 to keep the WebAssembly bundle reasonable and add languages on user request. Right-to-left scripts (Arabic, Hebrew) and CJK scripts (Chinese, Japanese, Korean) work but accuracy varies more than for Latin scripts because the training data for these languages is smaller in the open Tesseract model.
Not reliably. Tesseract was trained on printed text and its handwriting recognition accuracy is poor — typically 30-60% character accuracy on neat printing, much lower on cursive. For handwriting OCR you want a specialized cloud service: Google Cloud Vision's handwriting mode, Microsoft Azure Document Intelligence, or Apple's Vision framework on iOS/macOS all perform significantly better. If you only have a few handwritten documents, retyping them is often faster than fighting with OCR output. For a large volume of handwriting, the cloud route is the practical answer despite the privacy trade-off.
Slightly. OCR adds an invisible text layer on top of the image data — the image stays the same, the text layer is overlaid for selection and search. The text layer is small, typically 1-3% of the original file size. The bigger size impact comes from any preprocessing the tool does (contrast enhancement, deskewing, resampling to optimal DPI) which can slightly change the embedded image data. If file size matters, run compression after OCR — the searchable text layer is preserved while the image layer can be downsampled aggressively, often producing a smaller result than the original.
Related articles
Convert PDF to Excel (Tables to Spreadsheet)
Pull tables out of a PDF into editable Excel rows and columns. When it works well, when to expect cleanup, and how to do it free.
How to Convert Word to PDF (Keep Formatting)
Turn a .doc or .docx into a PDF that looks the same on every device. Why PDF beats sending Word, and how to convert for free.
How to Convert a PDF to JPG Images
Export PDF pages as JPG or PNG images — one per page or just the ones you need. Free, in your browser, nothing uploaded.