Skip to content

feat: add disabled_converters parameter, converters property, and CLI flags (closes #1665)#1813

Open
RhythrosaLabs wants to merge 1 commit intomicrosoft:mainfrom
RhythrosaLabs:feat/disabled-converters
Open

feat: add disabled_converters parameter, converters property, and CLI flags (closes #1665)#1813
RhythrosaLabs wants to merge 1 commit intomicrosoft:mainfrom
RhythrosaLabs:feat/disabled-converters

Conversation

@RhythrosaLabs
Copy link
Copy Markdown

Summary

Closes #1665

Resolves the two friction points raised in #1665no way to disable specific built-in converters without subclassing — with a clean, backward-compatible API addition.


Problem

When MarkItDown() is constructed, all built-in converters are registered unconditionally. There is currently no public surface to:

  1. Security-harden a deployment by excluding dangerous converters (e.g. ZipConverter for zip-bomb protection, AudioConverter to prevent implicit Whisper API network calls on untrusted input).
  2. Slim down a deployment where optional extras are intentionally not installed and the offending converter would immediately raise MissingDependencyException.
  3. Inspect which converters are active without reaching into the private _converters list.

Changes

packages/markitdown/src/markitdown/_markitdown.py

What Details
MarkItDown.__init__(disabled_converters=…) New keyword-only parameter. Accepts any iterable of DocumentConverter subclasses. Each listed type is skipped during enable_builtins(). Validation is eager: raises TypeError if any element is not a DocumentConverter subclass.
MarkItDown.disabled_converters property Returns the frozenset of types that were excluded at construction time.
MarkItDown.converters property Returns a read-only tuple[ConverterRegistration, …] snapshot sorted by priority — no more reaching into _converters.

packages/markitdown/src/markitdown/__main__.py

What Details
--disable-converter CONVERTER Repeatable flag; resolves class names against markitdown.converters. Unknown names exit non-zero with a clear message.
--list-converters Prints all built-in converter class names and exits 0.

packages/markitdown/src/markitdown/__init__.py

ConverterRegistration is now exported from the top-level package so callers can type-annotate against it.

packages/markitdown/tests/test_disabled_converters.py (new file)

22 tests covering:

  • Default construction registers all built-ins
  • Single / multiple / empty / frozenset disabling
  • TypeError on non-subclass, instance, or None
  • disabled_converters and converters properties
  • Snapshot immutability of converters
  • Behavioural: disabled converters are not invoked
  • enable_builtins=False + manual enable_builtins() pattern
  • CLI --list-converters, --disable-converter, unknown name detection, multi-flag

Usage examples

Python API

from markitdown import MarkItDown
from markitdown.converters import ZipConverter, AudioConverter

# Security hardening: never process zips or audio on this endpoint
md = MarkItDown(disabled_converters=[ZipConverter, AudioConverter])

# Inspect what is active
for reg in md.converters:
    print(type(reg.converter).__name__, reg.priority)

# What was excluded?
print(md.disabled_converters)
# frozenset({<class 'ZipConverter'>, <class 'AudioConverter'>})

CLI

# See all built-in converter names
markitdown --list-converters

# Disable ZipConverter and AudioConverter for this conversion
markitdown --disable-converter ZipConverter --disable-converter AudioConverter document.pdf

Before / After

Scenario Before After
Disable ZipConverter Subclass MarkItDown, override enable_builtins() MarkItDown(disabled_converters=[ZipConverter])
Know what is registered md._converters (private) md.converters (public, sorted tuple)
Know what was excluded Not possible md.disabled_converters (frozenset)
CLI disable Not possible --disable-converter ZipConverter
CLI list names Not possible --list-converters

Checklist

  • Backward-compatible — disabled_converters defaults to None / empty frozenset; all existing code works unchanged
  • enable_builtins=False + later enable_builtins() pattern also respects the disabled set
  • Eager input validation with a descriptive TypeError
  • 22 new tests; all existing tests pass (pre-existing test_speech_transcription failure is unrelated, requires local Whisper setup)
  • No changes to converter internals or converter-selection logic

Closes microsoft#1665

Resolves two friction points raised in issue microsoft#1665:

1. disabled_converters -- a new keyword-only parameter on
   MarkItDown.__init__() that accepts any iterable of
   DocumentConverter subclasses.  Each listed type is skipped during
   enable_builtins(), so it is never registered and therefore never
   invoked.  This enables:

   * Security hardening (e.g. disabling ZipConverter on a service that
     handles untrusted uploads, or AudioConverter to avoid implicit
     Whisper API network calls).
   * Leaner deployments where optional extras are not installed and the
     converter would raise MissingDependencyException anyway.

   Example usage:
     from markitdown import MarkItDown
     from markitdown.converters import ZipConverter, AudioConverter
     md = MarkItDown(disabled_converters=[ZipConverter, AudioConverter])

2. converters property -- a read-only tuple snapshot of all
   currently registered ConverterRegistration objects, sorted by
   priority, so callers can inspect what is active without touching
   the private _converters list.

3. disabled_converters property -- mirrors the frozenset passed at
   construction time for easy introspection / logging.

4. CLI: --disable-converter / --list-converters -- command-line
   equivalents for shell scripts and container entry-points.
   Unknown names exit non-zero with a clear message.

5. ConverterRegistration is now exported from the top-level
   markitdown package so users can type-annotate against it.

Implementation notes:
* Validation is eager: TypeError raised at construction if any element
  of disabled_converters is not a DocumentConverter subclass.
* The frozenset is immutable, preventing accidental post-construction
  mutation.
* enable_builtins() respects _disabled_converter_types, so the
  MarkItDown(enable_builtins=False) + later enable_builtins() pattern
  also honours the disabled set.
* No changes to existing converter internals; fully backward-compatible
  (disabled_converters defaults to None / empty frozenset).
* 22 new unit + integration tests added in test_disabled_converters.py.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

No way to disable specific built-in converters without subclassing

1 participant