How to Extract Text from a Scanned PDF or Image (OCR Guide)
You have a scanned contract, a photo of a whiteboard, or a screenshot of a recipe — and you need the text from it. You could retype it word by word, or you could let OCR do the work in seconds.
OCR (Optical Character Recognition) is the technology that reads text from images and scanned documents. In this guide, we will explain how OCR works, when you need it, and how to extract text from any scanned PDF or image for free using PDFFlare's free OCR tool.
What Is OCR and How Does It Work?
OCR stands for Optical Character Recognition. It is a technology that analyzes the shapes and patterns in an image to identify letters, numbers, and symbols. Modern OCR engines like Tesseract use machine learning models trained on millions of text samples to recognize characters across hundreds of languages and fonts.
Here is how the process works under the hood:
- Image preprocessing: The engine converts the image to grayscale, adjusts contrast, and removes noise to make text stand out from the background.
- Text detection: The engine locates regions in the image that contain text — paragraphs, lines, and individual characters.
- Character recognition: Each detected character is compared against trained models to determine what letter or symbol it represents.
- Post-processing: The engine applies language-specific rules and dictionary lookups to correct common misreadings and improve accuracy.
When Do You Need OCR?
If you work with documents regularly, you probably need OCR more often than you realize. Here are the most common situations:
Extracting Text from Scanned PDFs
A scanned PDF looks like a normal document, but it is actually just a picture of each page. You cannot select text, search for a word, or copy a paragraph. This is the most common use case for OCR — turning scanned documents back into editable, searchable text.
This includes old contracts, archived paperwork, signed agreements sent by email, and any document that was photocopied or scanned.
Converting Screenshots to Text
Need to grab text from an error message, a social media post, a website screenshot, or a chat conversation? Instead of retyping it manually, run OCR on the screenshot and copy the text in seconds. This is especially useful for extracting text from images on iPhone and Android when the built-in text selection does not work.
Digitizing Receipts and Invoices
Freelancers, small business owners, and accountants often need to extract amounts, dates, and vendor names from paper receipts for expense tracking. OCR turns a photo of a receipt into text you can paste into a spreadsheet — saving hours of manual data entry every month.
Extracting Text from Photos of Books or Notes
Students and researchers regularly photograph textbook pages, lecture slides, and handwritten notes. OCR extracts the printed text so you can paste it into your notes app, search for specific terms, or translate the content into another language.
Making Old Documents Searchable
If you have a folder full of scanned PDFs — tax documents, medical records, insurance paperwork — you cannot use Ctrl+F to find anything. Running OCR on these files gives you searchable text, making it dramatically easier to find what you need.
How to Extract Text from a Scanned PDF (Step by Step)
PDFFlare's OCR tool runs entirely in your browser using Tesseract.js. Your files never leave your device — everything is processed locally. Here is how to use it:
- Go to PDFFlare's OCR tool — no signup or account needed.
- Upload your file: Drag and drop a scanned PDF or image file (JPG, PNG, WebP, BMP, or TIFF). Files up to 50 MB are supported.
- Select the document language: Choose the language of the text in your document. English is selected by default, but PDFFlare supports over 100 languages including Spanish, French, German, Arabic, Chinese, Japanese, and Korean.
- Click "Extract Text": The OCR engine will process your document page by page. A progress bar shows the current status.
- Copy or download the result:Once processing is complete, the extracted text appears in a text box. Click "Copy" to copy it to your clipboard, or "Download .txt" to save it as a text file.
For multi-page scanned PDFs, PDFFlare processes each page separately and combines the output with page markers so you know which text came from which page.
How to Extract Text from an Image
The process is identical for images. Upload a JPG, PNG, WebP, BMP, or TIFF file and PDFFlare runs OCR on it directly. This works for:
- Screenshots from any device
- Photos of documents, receipts, or business cards
- Photos of whiteboards or handwritten notes (printed text only)
- Scanned pages saved as image files
- Infographics and images with embedded text
For the best results, use clear, high-resolution images. Blurry photos, extreme angles, and very small text reduce OCR accuracy.
Tips for Getting the Best OCR Results
OCR accuracy depends heavily on the quality of your input. Here are practical tips to get the cleanest text possible:
1. Use High-Resolution Scans
Scan at 300 DPI or higher. Low-resolution scans (72-150 DPI) make characters blurry, and the OCR engine struggles to distinguish between similar letters like "l" and "1" or "O" and "0".
2. Ensure Good Contrast
Dark text on a white background produces the best results. Faded documents, colored backgrounds, and light gray text reduce accuracy. If your scan is faded, try adjusting brightness and contrast in an image editor before running OCR.
3. Keep the Page Straight
Skewed or rotated text confuses the OCR engine. If your scanned PDF has rotated pages, use PDFFlare's Rotate PDF tool to fix the orientation before running OCR.
4. Select the Correct Language
The OCR engine loads language-specific models. Selecting the wrong language can cause misreadings — especially for non-Latin scripts like Arabic, Chinese, or Korean. If your document contains multiple languages, run OCR separately for each language section.
5. Clean Up Noise
Coffee stains, wrinkles, stamps, and other marks over text can confuse OCR. If possible, scan a clean copy of the document. For photos, ensure even lighting without shadows.
OCR Limitations: What It Cannot Do
OCR is powerful, but it has real limitations you should be aware of:
- Handwriting: Current OCR engines work best with printed text. Handwritten text recognition is unreliable, especially for cursive writing. Block letters in neat handwriting may work, but do not count on it.
- Complex layouts: Tables, multi-column text, and heavily formatted documents may produce jumbled output. OCR reads text in the order it detects it, which may not match the visual reading order of complex layouts.
- Very small text: Text below 8pt in the original document often gets misread, especially in low-resolution scans.
- Decorative fonts: Ornamental, script, and highly stylized fonts reduce accuracy. Standard serif and sans-serif fonts work best.
Is It Safe to Use Online OCR Tools?
Most online OCR tools upload your files to a server for processing. If your document contains sensitive information — contracts, medical records, financial statements — this is a legitimate privacy concern.
PDFFlare is different. Your files never leave your device.The OCR engine (Tesseract.js) runs entirely in your browser. No upload, no server processing, no data collection. This makes it safe for confidential documents that you would not want stored on someone else's server.
OCR vs. Copy-Paste: Why You Cannot Just Select Text
If you have ever tried to select text in a scanned PDF and nothing happened, you have encountered the fundamental problem OCR solves. Regular PDFs store text as vector data — each character has a defined position, font, and encoding. Scanned PDFs store pages as flat images. To the computer, a scanned page is just a picture — the same as a photo of a landscape. There are no text characters to select.
OCR bridges this gap by analyzing the image, finding text-like patterns, and converting them into actual characters you can select, copy, and search.
What to Do After Extracting Text
Once you have the extracted text, here are some common next steps:
- Paste into a Word document: Create an editable version of the scanned document. Use PDFFlare's Word to PDF tool to convert it back to PDF when done editing.
- Search for keywords: Use Ctrl+F in the text file to find specific names, dates, or amounts in long documents.
- Translate the content: Paste extracted text into Google Translate or DeepL for instant translation.
- Import into a spreadsheet: Paste receipt or invoice data into Excel or Google Sheets for accounting.
- Archive and index: Save the text alongside the original scan for full-text search capability.
Wrapping Up
OCR turns images and scanned documents into usable text — saving you from retyping content that is already right there on the page. Whether you are digitizing old paperwork, extracting data from receipts, or grabbing text from a screenshot, OCR handles it in seconds.
PDFFlare's OCR tool is free, runs in your browser, and supports 100+ languages. No signup, no upload, no privacy concerns. Try it with your next scanned document and see how much time it saves.
Related Tools
- PDF to Word — convert the OCR output into an editable Word file
- PDF to JPG — extract scanned pages as images
- Edit PDF — add searchable text annotations