fix: OCR scanned pages when mixed with text pages by Brumbelow · Pull Request #1821 · microsoft/markitdown

Brumbelow · 2026-04-22T17:42:43Z

When a multi-page PDF mixed selectable text with scanned pages, the converter only emitted content for pages that had either an image object or an extractable text layer. A scanned page with neither (no page.images, no page.extract_text()) silently fell through, and the whole-document _ocr_full_pages fallback only runs when every page is empty. So pages 2..N just disappeared.
This adds a small _ocr_page helper that renders a single pdfplumber page to PNG and runs the configured OCR service on it, and calls it from the per-page loop when the page yields nothing the existing branches can use. The output format matches what _ocr_full_pages already emits, so existing snapshot tests are unaffected.

Test: new pdf_text_then_blank.pdf fixture (~1.7 KB, page 1 has selectable text, page 2 is blank) plus a snapshot test that asserts both pages survive the round-trip.

Brumbelow · 2026-04-22T17:43:53Z

@microsoft-github-policy-service agree

fix: OCR scanned pages when mixed with text pages

9e525dd

Brumbelow changed the title ~~fix: OCR scanned pages when mixed with text pages~~ fix: OCR scanned pages when mixed with text pages (#1791) Apr 22, 2026

Brumbelow changed the title ~~fix: OCR scanned pages when mixed with text pages (#1791)~~ fix: OCR scanned pages when mixed with text pages Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: OCR scanned pages when mixed with text pages#1821

fix: OCR scanned pages when mixed with text pages#1821
Brumbelow wants to merge 1 commit intomicrosoft:mainfrom
Brumbelow:fix/ocr-multipage-scanned-pdf

Brumbelow commented Apr 22, 2026

Uh oh!

Brumbelow commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Brumbelow commented Apr 22, 2026

Uh oh!

Brumbelow commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant