Skip to content

feat: Add batch conversion API and CLI support#1825

Open
jacken wants to merge 9 commits intomicrosoft:mainfrom
jacken:feat/batch-conversion
Open

feat: Add batch conversion API and CLI support#1825
jacken wants to merge 9 commits intomicrosoft:mainfrom
jacken:feat/batch-conversion

Conversation

@jacken
Copy link
Copy Markdown

@jacken jacken commented Apr 23, 2026

Adds convert_batch() to the MarkItDown class and corresponding CLI flags for converting multiple files concurrently.

Library API

A new convert_batch() method on MarkItDown accepts an iterable of sources and yields BatchConversionResult objects in completion order:

from markitdown import MarkItDown, BatchConversionResult

md = MarkItDown()
for result in md.convert_batch(["file1.pdf", "file2.docx", "file3.html"]):
    if result.success:
        print(result.result.markdown)
    else:
        print(f"Failed: {result.source}{result.error}")

Each BatchConversionResult has:

  • source — the original input
  • result — a DocumentConverterResult, or None on failure
  • error — the exception, or None on success
  • successTrue when no error is set

Error handling is configurable via on_error:

  • "collect" (default) — wraps errors and continues
  • "raise" — re-raises the first error immediately

Concurrency uses ThreadPoolExecutor (appropriate since heavy converters like pdfminer and lxml release the GIL). A pluggable executor parameter lets callers supply their own executor; if omitted, one is created and shut down automatically. The workers parameter controls thread count when using the default executor.

BatchConversionResult is exported from the top-level markitdown package.

CLI

# Convert multiple files to stdout (separated by banners)
markitdown file1.pdf file2.docx

# Write each file to an output directory
markitdown file1.pdf file2.docx --output-dir ./output/

# Control parallelism
markitdown file1.pdf file2.docx --workers 4

# Stop on first error instead of collecting all errors
markitdown file1.pdf file2.docx --fail-fast

Single-file and stdin modes are unchanged.

Tests

  • tests/test_batch.py — unit and integration tests for BatchConversionResult and convert_batch()
  • tests/test_cli_misc.py — CLI tests for multi-file stdout, --output-dir, --fail-fast, --workers, and error collection

jacken and others added 9 commits April 23, 2026 00:34
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds concurrent batch conversion via ThreadPoolExecutor with configurable
on_error handling (collect/raise), custom executor support, and workers param.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ut of loop, fix fail-fast test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jacken jacken changed the title ## Add batch conversion API and CLI support Add batch conversion API and CLI support Apr 23, 2026
@jacken
Copy link
Copy Markdown
Author

jacken commented Apr 23, 2026

@microsoft-github-policy-service agree

@jacken jacken changed the title Add batch conversion API and CLI support Feat: Add batch conversion API and CLI support Apr 24, 2026
@jacken jacken changed the title Feat: Add batch conversion API and CLI support feat: Add batch conversion API and CLI support Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant