Skip to content

fix: handle case-mismatched zip entry names in .docx files#1820

Open
octo-patch wants to merge 1 commit intomicrosoft:mainfrom
octo-patch:fix/issue-1812-badzip-case-mismatch
Open

fix: handle case-mismatched zip entry names in .docx files#1820
octo-patch wants to merge 1 commit intomicrosoft:mainfrom
octo-patch:fix/issue-1812-badzip-case-mismatch

Conversation

@octo-patch
Copy link
Copy Markdown

Fixes #1812

Problem

Some .docx files produced by certain tools (e.g. legal document systems) have inconsistent casing between the zip central directory and local file headers — for example, the central directory lists customXml/item2.xml but the local file header contains customXML/item2.xml. This is technically a violation of the zip spec, but is produced by certain versions of Microsoft Word and third-party tools.

Python's zipfile module strictly validates this match and raises BadZipFile:

markitdown._exceptions.FileConversionException: File conversion failed after 1 attempts:
 - DocxConverter threw BadZipFile with message: File name in directory 'customXml/item2.xml' and header b'customXML/item2.xml' differ.

Solution

Add _fix_zip_name_casing() to converter_utils/docx/pre_process.py. This function:

  1. Reads the raw zip bytes into a bytearray
  2. Opens the zip using zipfile.ZipFile (which reads the central directory first — this succeeds even with mismatched local headers)
  3. For each central directory entry, finds the corresponding local file header at header_offset
  4. If the local filename differs from the central directory filename only in case (same byte length), patches the local header bytes in-place
  5. Returns a patched BytesIO if any headers were fixed, otherwise returns the original stream unchanged

The central directory is authoritative per the zip spec. The patch is safe: it only applies to case-only differences (same byte length in ASCII paths), so no offset recalculation is needed.

pre_process_docx() now calls _fix_zip_name_casing() as its first step, before opening the zip for math pre-processing.

Testing

Added test_fix_zip_name_casing() in test_module_misc.py that:

  • Builds a valid two-entry zip in memory
  • Deliberately corrupts the local file header of one entry to introduce a case mismatch
  • Asserts that reading the corrupted entry raises BadZipFile (reproduces the bug)
  • Applies _fix_zip_name_casing() and asserts both entries are now readable without error

…rosoft#1812)

Some .docx files produced by certain tools have inconsistent casing between
the zip central directory and local file headers (e.g. 'customXml/item2.xml'
vs 'customXML/item2.xml'). Python's zipfile raises BadZipFile on this mismatch,
causing DocxConverter to throw FileConversionException.

Add _fix_zip_name_casing() to pre_process.py that patches local file headers
in memory to match the authoritative central directory before the zip is opened.
Add a unit test that verifies the fix resolves the BadZipFile error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BadZipFile crash on .docx files with case-mismatched zip entry names

1 participant