How to Compare a Scanned PDF Against a Word Contract
You have a Word document on your screen. It is the version you sent to the other side three weeks ago: the agreed-upon contract, carefully negotiated over four rounds of redlines. What you got back is a scanned PDF. Maybe it is the executed copy, signed and returned as a scan. Maybe the counterparty's office only had a paper copy and scanned it before sending. Maybe the other side's legal team converted to PDF, printed, signed, and scanned the result back to you.
Either way, you need to confirm that the scanned version matches what you agreed to. And you have a problem: a scanned PDF is not a document. It is a picture of a document. Your comparison tool cannot read it. Your eyes can read it, but not reliably enough to catch every difference across 30 pages of dense legal text.
This guide walks through how to compare a scanned contract against a Word original, step by step. We will cover the tools, the process, the things that go wrong, and the limitations you need to understand before you trust the result.
Why comparing a scanned PDF is harder than comparing two Word files
When you compare two Word documents, the comparison tool reads the underlying text, formatting, and structure directly from the file. A .docx file is a structured XML archive. The comparison engine knows exactly what every word says, where every paragraph starts, and how every table is structured. The comparison is deterministic: same inputs, same output, every time.
A scanned PDF is fundamentally different. It contains images, not text. Each page is a photograph of a printed page. The PDF viewer renders these images so they look like a document, but there is no text layer underneath. If you try to select text in a scanned PDF, nothing highlights. If you try to search for a word, the search finds nothing.
This means a comparison tool cannot work with a scanned PDF directly. There is nothing to compare. The file does not contain text; it contains pixels arranged in patterns that a human eye interprets as text. To make a comparison possible, you need an intermediate step: optical character recognition, or OCR.
OCR is software that looks at an image of text and attempts to recognize the characters. It converts the image into machine-readable text. Modern OCR is remarkably good, but it is not perfect. And those imperfections are exactly what makes scanned-contract comparison tricky. Every OCR error shows up as a "difference" in your comparison output, even though nothing actually changed in the contract.
Step-by-step: from scanned PDF to usable comparison
The process has four stages. Each one introduces potential issues, and understanding those issues is as important as knowing the steps.
Step 1: OCR the scanned PDF
The first step is to run OCR on the scanned PDF to extract the text. You have several options, ranging from free to professional-grade.
Adobe Acrobat Pro is the most common choice in legal practice. Open the scanned PDF in Acrobat Pro, go to "Scan & OCR" (or "Recognize Text" in older versions), and select "Recognize Text in This File." Acrobat will process each page, identify the text, and add a text layer underneath the scanned images. The result is a "searchable PDF" where the visual appearance is unchanged but the text is now selectable and extractable. Acrobat's OCR is solid on clean scans of standard legal documents.
ABBYY FineReader is generally considered the most accurate OCR engine available. It handles complex layouts better than Acrobat, particularly tables, multi-column text, headers, and footers. It can also export directly to Word format (.docx), which is useful for the next step. If you regularly work with scanned contracts, ABBYY is worth the investment. It costs more than Acrobat, but for legal work where accuracy matters, the difference in recognition quality is noticeable.
Free alternatives exist but come with tradeoffs. Google Drive can perform OCR: upload a PDF to Google Drive, right-click it, and select "Open with Google Docs." Google will attempt to OCR the content and present it as a Google Doc. The results are adequate for simple, clean scans but degrade quickly with tables, columns, or poor image quality. Microsoft OneNote can also perform basic OCR by pasting an image into a note and using "Copy Text from Picture." The open-source Tesseract engine, available through front-ends like gImageReader, is powerful but requires some technical comfort to set up and use effectively.
Whichever tool you use, the key setting is language. Make sure the OCR engine is set to the correct language for your document. An OCR engine set to German will struggle with English legal text, and vice versa. If the contract contains terms in multiple languages (common in international agreements), some OCR engines allow multi-language recognition, though accuracy for the secondary language will be lower.
Step 2: Convert the OCR output to a format your comparison tool can use
After OCR, you have text. But the format of that text matters for comparison.
Option A: Export to .docx. If your comparison tool works with Word files (which most do), exporting the OCR result to .docx is the most practical path. ABBYY FineReader does this natively with good formatting preservation. Adobe Acrobat Pro can export a searchable PDF to Word via "Export PDF" under the File menu, though the formatting fidelity varies. The resulting .docx will not be a pixel-perfect replica of the original Word document. Tables may be reconstructed imperfectly. Headers and footers may end up in the body. Page breaks may shift. These formatting differences will show up in your comparison, so be prepared for noise.
Option B: Extract plain text. If you only care about whether the words match and do not need formatting comparison, you can extract the OCR text as plain text. This simplifies the comparison but loses all structural information: tables become jumbled text, numbered lists lose their hierarchy, and columns merge into a single stream. Plain text extraction is useful as a quick sanity check but is not sufficient for a thorough comparison of a complex contract.
Option C: Use the searchable PDF directly. Some comparison tools (Draftable, for example) can accept PDF input and will extract text from the PDF's text layer for comparison. If you used Acrobat to create a searchable PDF, this lets you skip the Word conversion step entirely. The comparison quality depends on how well the tool handles PDF text extraction, which varies. Tables and multi-column layouts often suffer.
Step 3: Run the comparison against your Word document
With your OCR output now in a usable format, you can run the comparison. If you exported to .docx, compare that file against your original Word document using whatever comparison tool you normally use: Word Compare, Clausul, Draftable, or Litera Compare.
Before you open the comparison output, set your expectations. You will see more differences than a normal comparison between two Word documents. Many of those differences are OCR artifacts, not actual changes. This is normal and unavoidable. The goal is not a clean comparison with zero noise. The goal is to identify any real differences that indicate the scanned version does not match your Word original.
A useful approach: if your comparison tool supports it, set the comparison to content-only mode (ignoring formatting). This reduces the noise from formatting differences introduced by the OCR-to-Word conversion, which are never meaningful in this context. You already know the formatting will differ because the document went through an OCR-and-conversion pipeline. What you care about is whether the words match.
Step 4: Account for OCR artifacts in the comparison output
This is the step that separates a useful comparison from a frustrating one. The comparison output will contain two kinds of differences:
- Real differences: the scanned version actually contains different text than the Word original. A clause was added, removed, or modified. A number changed. A party name is different. These are what you are looking for.
- OCR artifacts: the scanned version contains the same text, but the OCR engine misread one or more characters. These look like changes in the comparison output but are not real differences in the contract.
Learning to tell these apart quickly is the core skill of scanned-contract comparison. OCR artifacts have recognizable patterns. Once you know what to look for, you can scan through them much faster.
Common OCR artifact patterns:
- Character substitution: "l" (lowercase L) read as "1" (the digit one). "O" (capital O) read as "0" (zero). "rn" (r followed by n) read as "m." "cl" read as "d." These are the most frequent OCR errors and account for the majority of false differences in a comparison.
- Punctuation errors: periods read as commas. Semicolons read as colons. Hyphens dropped or added. Quotation marks changed from straight to curly or vice versa. Parentheses read as brackets.
- Spacing errors: words run together ("theParty" instead of "the Party"). Extra spaces inserted mid-word ("indem nification"). These are especially common near the edges of scanned pages where the text curves into the binding.
- Line break artifacts: hyphens at the end of a line (from justified text) either retained or dropped, breaking a word into two fragments or merging it with the next line.
When you encounter a difference in the comparison output, ask: does this look like an OCR misread, or does it look like a real textual change? A "1" where you expected an "l" in the word "liability" is almost certainly OCR noise. A different dollar amount in a payment clause is almost certainly a real difference that requires investigation.
For ambiguous cases where you cannot tell whether a difference is OCR noise or a real change, go back to the scanned PDF and read the original image. Zoom in on the relevant text. If the scanned image clearly shows the same text as your Word document, the difference is an OCR error. If the scanned image shows different text, you have found a real discrepancy.
OCR quality issues and how they affect your comparison
The quality of the comparison is only as good as the quality of the OCR. And OCR quality depends on several factors, most of which are determined before you even start.
Scan resolution
The single biggest factor in OCR accuracy is the resolution of the scan. A 300 DPI (dots per inch) scan produces significantly better OCR results than a 150 DPI scan. At 150 DPI, the OCR engine is working with blurry character shapes that are ambiguous. At 300 DPI, the characters are crisp enough for reliable recognition. If you have any control over the scanning process, insist on 300 DPI minimum. For documents with small fonts (footnotes, table text), 400 DPI is better.
If you receive a low-resolution scan and cannot request a better one, your OCR results will be worse. You will see more artifacts in the comparison. This does not make the comparison useless, but it means you need to treat the output with more skepticism and verify more differences manually against the scanned image.
Scan quality and physical condition
Beyond resolution, the physical quality of the scan matters. Pages that were scanned at an angle produce skewed text that OCR engines struggle with. Pages scanned from a book or bound document have curved text near the binding that degrades recognition. Photocopies of photocopies (which still happen, especially with older agreements) have degraded character quality. Coffee stains, pen marks, sticky-note shadows, and highlighter marks all interfere with character recognition.
Fax transmissions are a particular problem. Faxed documents have low resolution (typically 200 DPI) and lossy compression that degrades fine details. A contract that was faxed, printed from the fax, and then scanned has gone through three rounds of quality degradation. The OCR output from such a document will be noticeably worse than from a clean scan of a laser-printed page.
Tables and structured content
Tables are the hardest content type for OCR. The OCR engine needs to recognize not just the characters but the table structure: where cells begin and end, which text belongs to which cell, how rows and columns align. Even the best OCR engines struggle with complex tables, especially when cell borders are thin or when text wraps within cells.
For contracts with important tabular content (pricing schedules, service level matrices, milestone tables), the OCR output is least reliable exactly where it matters most. You may find that the OCR engine merges adjacent cells, splits a single cell into two, reorders columns, or drops row headers entirely. When the comparison flags differences in table content, always verify against the scanned image. Do not assume the OCR got the table right.
Headers, footers, and page numbers
Repeating headers and footers create a predictable source of noise. The OCR engine processes them on every page, and if it misreads a character in the header on page 12 but reads it correctly on page 13, the comparison may flag inconsistencies within the OCR output itself. When comparing against your Word original, every header and footer instance that OCR misreads will appear as a difference. If your contract has detailed headers (many law firm templates include the document title, date, and draft number in headers), this noise adds up.
A practical approach is to ignore header and footer differences in the comparison entirely, unless you have specific reason to believe the scanned version has different header or footer content. Focus your review on the body text.
Handwritten annotations and signatures
Standard OCR engines are designed for printed text. They do not reliably read handwriting. If the scanned contract includes handwritten margin notes, initials next to specific clauses, or handwritten fill-ins on blank lines, the OCR engine will either ignore them entirely or produce garbled output. Handwritten annotations will not appear in your comparison at all if the OCR engine ignores them, or they will appear as nonsensical text insertions if the engine attempts to read them.
This is a significant limitation. If someone hand-annotated the contract before scanning, those annotations may represent agreed-upon modifications. They will not show up in your text comparison. You need to review the scanned images visually for any handwritten marks, separately from the text comparison.
When scanned contract comparison matters most
Not every scanned PDF needs to be compared against a Word original. But several common scenarios make this comparison essential.
Verifying that the executed copy matches the agreed version
This is the most common scenario. You negotiated a contract in Word, agreed on the final language, and the other side sent back a signed copy as a scanned PDF. You need to confirm that what they signed is what you agreed to. Maybe they printed the wrong version. Maybe they made last-minute handwritten changes. Maybe the version that went to the printer had a stale paragraph from an earlier draft.
This is essentially the same verification step as comparing the final clean copy before signing, but harder because you are working with a scan instead of a .docx file. The stakes are the same: you are about to rely on this document as the binding agreement. If it does not match what was negotiated, you need to know before it goes into your records as the executed version.
Reviewing contracts from counterparties who only send scans
Some counterparties, particularly in international transactions or in industries where digital workflows are less established, send scanned PDFs as their standard document format. They may not have a Word version to share, or they may be reluctant to share the editable file. When you receive a scanned contract that you need to compare against your own version, OCR is the only way to get to a text-based comparison.
This scenario is frustrating because the ideal solution is simple: ask for the Word file. But when that is not possible (because the counterparty will not provide it, or because the document genuinely only exists as a scan), OCR-based comparison is your best available option.
Legacy contract comparison
Older agreements often exist only as scans. The firm may have paper originals in storage and scanned copies in the document management system, but no Word file. When a new transaction requires comparing the current deal terms against a legacy agreement (for example, when negotiating a renewal or amendment), OCR is the only way to get the legacy text into a format that can be compared against the new draft.
Legacy comparison has an additional challenge: older scans tend to be lower quality. Documents scanned ten or fifteen years ago were often scanned at 150 DPI or less, and the physical originals may have degraded. Expect more OCR noise with older scans and budget additional time for manual verification.
Audit and compliance reviews
Regulatory audits sometimes require comparing executed contracts against approved templates or standard terms. If the executed contracts are stored as scans (common in older filing systems), the comparison requires OCR. Compliance teams may need to verify that hundreds of executed agreements conform to approved language, which makes the scale of the OCR challenge much larger.
What OCR-based comparison cannot reliably catch
OCR-based comparison is a useful tool, but it has real limitations. Understanding them is important because overconfidence in the comparison output can be worse than not running it at all.
Single-character changes that look like OCR errors
This is the fundamental problem. If someone changed "30" to "3O" (the digit three followed by the capital letter O) in a dollar amount, and the OCR engine reads the scanned "3O" as "30" (correcting the visual ambiguity), the comparison will show no difference. Conversely, if the scanned version legitimately says "30" and the OCR misreads the "0" as "O," the comparison will flag a difference that does not exist.
The problem is symmetrical: OCR errors can both hide real differences and create false ones. You cannot distinguish the two without going back to the scanned image. For high-stakes comparisons, this means you cannot fully trust either the presence or absence of a flagged difference without visual verification.
Formatting and layout changes
OCR extracts text but does not reliably preserve formatting intent. Bold, italic, and underline are sometimes detected, sometimes missed. Indentation levels are approximate. Paragraph spacing is reconstructed from image analysis and may not match the original Word formatting. As a result, any formatting comparison between a scanned document and a Word original is essentially meaningless. If the other side reformatted the contract (different fonts, different margins, different template), you will not be able to tell from the OCR output whether those were intentional formatting changes or OCR reconstruction artifacts.
This matters for contracts where formatting carries meaning. Numbered lists, indentation levels, and section hierarchy can affect interpretation. If a sub-clause was promoted to a main clause (changing its legal scope), the OCR output may or may not reflect the structural change, depending on how the OCR engine reconstructed the document hierarchy.
Images, stamps, and watermarks
If the scanned contract includes stamps (notary stamps, company seals), watermarks ("DRAFT," "CONFIDENTIAL"), or embedded images (logos, diagrams), the OCR engine will either ignore them or attempt to read them as text, producing noise. These elements will not appear in your comparison output in any meaningful way. If a stamp or watermark conveys important information (for example, "EXECUTED" vs. "DRAFT" watermark), you need to check it visually.
Page ordering and completeness
OCR processes pages sequentially. If the scanned PDF is missing a page (it happens, especially with multi-page scans where a page did not feed through the scanner), the OCR output will be shorter than the original. The comparison will flag this as a deletion, which is correct. But if the scanning order was wrong (pages 14 and 15 swapped), the comparison output will show a confusing pattern of deletions and insertions that may be hard to interpret.
Before running the comparison, check the page count. If your Word original is 32 pages and the scanned PDF is 31 pages, investigate the missing page before spending time on a text comparison.
Best practices: avoiding the problem altogether
The best way to handle scanned-contract comparison is to avoid needing it in the first place. This is not always possible, but several practices significantly reduce how often you end up in the "I have a Word file and a scanned PDF and I need to compare them" situation.
Always keep the Word original
This sounds obvious, but it is worth stating explicitly. Every contract your team drafts or negotiates should have the final Word version stored in your document management system, clearly labeled as the agreed version. If the executed copy is a scan, the Word version is your baseline for comparison. If you do not have the Word original, you are comparing a scan against nothing, which means you are reading the scan and hoping you remember what it should say.
Request Word versions whenever possible
When the counterparty sends a scanned PDF, ask for the Word version. Many parties will provide it on request. Some will not, either because they do not have it (unlikely for recently drafted documents) or because they prefer not to share editable files (more common). But the request costs nothing, and receiving a Word file eliminates the entire OCR pipeline and all its associated uncertainty.
Make this a standard part of your process: when you receive a signed contract as a scanned PDF, reply with "Thanks. Can you also send the Word version of the executed document for our records?" Many counterparties will comply without question.
Compare before printing and scanning
If your own firm is preparing the execution copy, run the comparison while you still have both documents in Word format. The time to verify that the execution copy matches the agreed version is before it goes to the printer, not after it comes back from the scanner. This gives you a clean, high-confidence comparison with none of the OCR complications.
Scan at high resolution
When you do need to scan contracts, use 300 DPI minimum. Set the scanner to black-and-white or grayscale mode (not color, which produces larger files without improving OCR accuracy). Use a flatbed scanner rather than a sheet feeder for important documents, as sheet feeders can produce slight skew that degrades OCR quality. If your office scanner has an "OCR optimization" or "text enhancement" setting, enable it.
Run OCR immediately after scanning
Do not wait until you need to compare a scanned document to OCR it. OCR the document as part of the scanning workflow, so the searchable PDF is the version that goes into your document management system. This saves time later and also makes the document searchable for other purposes (full-text search, matter indexing, knowledge management).
Use comparison tools that handle PDF input
If scanned contract comparison is a regular part of your workflow (not a once-a-year occurrence), consider using a comparison tool that accepts PDF input and handles OCR internally. This streamlines the process by combining the OCR and comparison steps. Some tools, like Draftable, accept PDFs directly. Others require .docx input, which means you need the separate OCR-and-convert step described above.
Clausul currently works with .docx files. If you have two Word documents to compare, upload them to Clausul for a comparison that separates substantive changes from formatting noise. For scanned PDFs, you would need to convert to .docx first using the OCR process described in this guide.
Putting it together: a realistic workflow
Here is the complete workflow for comparing a scanned contract against a Word original, from start to finish.
- Receive the scanned PDF. Check the page count against your Word original. If pages are missing, request a complete scan before proceeding.
- Quick visual scan. Flip through the scanned PDF looking for obvious issues: missing pages, rotated pages, handwritten annotations, stamps, or significant visual differences (different template, different formatting).
- OCR the scan. Use Adobe Acrobat Pro, ABBYY FineReader, or your preferred OCR tool. Verify the language setting matches the contract language.
- Convert to .docx. Export the OCR result to Word format. Open the resulting .docx and do a quick sanity check: does the text look reasonable? Are tables roughly intact? Are there obvious OCR failures (large blocks of garbled text)?
- Run the comparison. Compare the OCR-generated .docx against your original Word document. Use content-only comparison mode if available.
- Review with OCR awareness. Go through the comparison output. For each flagged difference, determine whether it is an OCR artifact (character substitution, spacing error, punctuation misread) or a real textual difference.
- Verify real differences. For any difference that looks real (not an OCR artifact), go back to the scanned PDF image and verify visually. Zoom in on the relevant text. Confirm whether the scanned version actually differs from the Word original.
- Check handwritten elements separately. If the scanned contract has any handwritten text, annotations, or fill-ins, review those visually. They will not appear in the text comparison.
- Document your findings. Note any confirmed differences between the scanned version and the Word original. Flag them for follow-up with the counterparty or with the responsible attorney on the matter.
Total time for a clean comparison (the scanned version matches the Word original): 15 to 30 minutes, depending on document length and scan quality. Most of that time is the OCR processing and the artifact review. If you find real differences, add whatever time is needed to investigate and resolve them.
That is significantly longer than the two-minute .docx-to-.docx comparison you would do with two Word files. This is exactly why the best practice is to avoid the scanned-PDF scenario whenever possible. But when you cannot avoid it, this process gives you reasonable confidence that the documents match, or flags the differences that need attention.
Frequently asked questions
Can I compare a scanned PDF directly against a Word document without OCR?
No. A scanned PDF contains images of pages, not text. There is no text layer for a comparison tool to work with. You must run OCR first to extract the text from the scanned images before any comparison is possible. Without OCR, the comparison tool has nothing to compare against your Word document.
How accurate is OCR for legal documents?
Modern OCR engines achieve 95-99% character accuracy on clean, high-resolution scans of printed text. That sounds high, but on a 30-page contract with roughly 15,000 words, even 99% accuracy means approximately 150 characters will be wrong. Those errors show up as false differences in your comparison output. Accuracy drops significantly with poor scan quality, small fonts, tables, and handwritten annotations. For legal verification purposes, OCR-based comparison is useful for catching large-scale differences but should not be your sole method of verification.
What is the best OCR tool for legal contracts?
Adobe Acrobat Pro is the most widely used OCR tool in legal practice and handles most contracts well. ABBYY FineReader consistently produces the highest accuracy, particularly on complex layouts with tables and columns. For free alternatives, the built-in OCR in newer versions of Microsoft OneNote or the open-source Tesseract engine (via a front-end like PDF OCR X or gImageReader) can work for simple, clean scans. The best tool depends on your scan quality and volume. If you regularly compare scanned contracts, investing in ABBYY or Acrobat Pro pays for itself quickly.
What scan resolution do I need for reliable OCR?
At minimum, 300 DPI. For best results, especially with smaller fonts or detailed tables, 400-600 DPI is recommended. Many office scanners default to 150 or 200 DPI, which produces noticeably worse OCR results. If you are scanning contracts specifically for comparison purposes, set your scanner to 300 DPI black-and-white or grayscale. Color scanning rarely improves OCR accuracy and produces much larger files.
Can Clausul compare a scanned PDF against a Word document?
Clausul currently compares .docx files. It does not perform OCR or accept scanned PDFs as input. If you have a scanned PDF, you would need to OCR it and convert it to .docx first using a tool like Adobe Acrobat Pro or ABBYY FineReader, then upload the resulting .docx to Clausul for comparison. We are transparent about this limitation because OCR-based comparison introduces a layer of uncertainty that users should be aware of.
Is OCR-based comparison reliable enough to verify an executed contract?
It depends on what you mean by "verify." OCR-based comparison is good enough to catch large-scale discrepancies: missing clauses, changed dollar amounts, different party names, added or deleted sections. It is not reliable enough to catch every single-character difference, because OCR errors will generate false positives that make it hard to distinguish real changes from recognition artifacts. For high-stakes verification of an executed copy, OCR-based comparison is a useful first pass, but it should be supplemented by a careful read of any sections where the comparison flags differences, and ideally by requesting the Word version of the executed document.