feat: Add batch conversion API and CLI support#1825
Open
jacken wants to merge 9 commits intomicrosoft:mainfrom
Open
feat: Add batch conversion API and CLI support#1825jacken wants to merge 9 commits intomicrosoft:mainfrom
jacken wants to merge 9 commits intomicrosoft:mainfrom
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds concurrent batch conversion via ThreadPoolExecutor with configurable on_error handling (collect/raise), custom executor support, and workers param. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ut of loop, fix fail-fast test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
|
@microsoft-github-policy-service agree |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds
convert_batch()to theMarkItDownclass and corresponding CLI flags for converting multiple files concurrently.Library API
A new
convert_batch()method onMarkItDownaccepts an iterable of sources and yieldsBatchConversionResultobjects in completion order:Each
BatchConversionResulthas:source— the original inputresult— aDocumentConverterResult, orNoneon failureerror— the exception, orNoneon successsuccess—Truewhen no error is setError handling is configurable via
on_error:"collect"(default) — wraps errors and continues"raise"— re-raises the first error immediatelyConcurrency uses
ThreadPoolExecutor(appropriate since heavy converters like pdfminer and lxml release the GIL). A pluggableexecutorparameter lets callers supply their own executor; if omitted, one is created and shut down automatically. Theworkersparameter controls thread count when using the default executor.BatchConversionResultis exported from the top-levelmarkitdownpackage.CLI
Single-file and stdin modes are unchanged.
Tests
tests/test_batch.py— unit and integration tests forBatchConversionResultandconvert_batch()tests/test_cli_misc.py— CLI tests for multi-file stdout,--output-dir,--fail-fast,--workers, and error collection