Skip to content

fix: write UTF-8 bytes directly to stdout to avoid encoding errors#1800

Open
fuleinist wants to merge 1 commit intomicrosoft:mainfrom
fuleinist:fix/stdout-encoding-utf8
Open

fix: write UTF-8 bytes directly to stdout to avoid encoding errors#1800
fuleinist wants to merge 1 commit intomicrosoft:mainfrom
fuleinist:fix/stdout-encoding-utf8

Conversation

@fuleinist
Copy link
Copy Markdown

Fix: stdout Unicode encoding errors on Windows

On Windows (and other platforms) where sys.stdout.encoding is limited (e.g., cp1252, gbk), piping markitdown output to a file causes UnicodeEncodeError for characters outside the target encoding.

Problem

The previous workaround of encode+decode with errors='replace' still failed when stdout.encoding was None (piped stdout on some platforms), and didn't solve the root issue of stdout's limited text encoding.

Solution

Write UTF-8 bytes directly to sys.stdout.buffer, which:

  • Bypasses stdout's text encoding limitation
  • Works reliably when stdout is piped or redirected
  • Handles all Unicode characters correctly
  • Falls back to print() with encoding='utf-8' for unusual cases

Issues Fixed

Testing

  • Verified CLI works with Unicode input via stdin
  • Import test passes
  • CLI version check passes

On Windows (and other platforms) where sys.stdout.encoding is limited
(e.g., cp1252, gbk), piping markitdown output to a file causes
UnicodeEncodeError for characters outside the target encoding.

The previous workaround of encode+decode with errors='replace' still
failed when stdout.encoding was None, and didn't solve the root
issue of stdout's limited encoding.

This fix writes UTF-8 bytes directly to sys.stdout.buffer, which:
- Bypasses stdout's text encoding limitation
- Works reliably when stdout is piped or redirected
- Handles all Unicode characters correctly
- Falls back to print() with encoding='utf-8' for unusual cases

Fixes: microsoft#1788
Fixes: microsoft#1597
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UnicodeEncodeError Unicode error converting Microsoft Documentation PDF to markdown file.

1 participant