Skip to content

fix: OCR scanned pages when mixed with text pages#1821

Open
Brumbelow wants to merge 1 commit intomicrosoft:mainfrom
Brumbelow:fix/ocr-multipage-scanned-pdf
Open

fix: OCR scanned pages when mixed with text pages#1821
Brumbelow wants to merge 1 commit intomicrosoft:mainfrom
Brumbelow:fix/ocr-multipage-scanned-pdf

Conversation

@Brumbelow
Copy link
Copy Markdown

Fixes #1791

  • When a multi-page PDF mixed selectable text with scanned pages, the converter only emitted content for pages that had either an image object or an extractable text layer. A scanned page with neither (no page.images, no page.extract_text()) silently fell through, and the whole-document _ocr_full_pages fallback only runs when every page is empty. So pages 2..N just disappeared.

  • This adds a small _ocr_page helper that renders a single pdfplumber page to PNG and runs the configured OCR service on it, and calls it from the per-page loop when the page yields nothing the existing branches can use. The output format matches what _ocr_full_pages already emits, so existing snapshot tests are unaffected.

Test: new pdf_text_then_blank.pdf fixture (~1.7 KB, page 1 has selectable text, page 2 is blank) plus a snapshot test that asserts both pages survive the round-trip.

@Brumbelow
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

@Brumbelow Brumbelow changed the title fix: OCR scanned pages when mixed with text pages fix: OCR scanned pages when mixed with text pages (#1791) Apr 22, 2026
@Brumbelow Brumbelow changed the title fix: OCR scanned pages when mixed with text pages (#1791) fix: OCR scanned pages when mixed with text pages Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Vision-LLM conversion only processes the first page of scanned PDFs

1 participant