QUERY · ISSUE

68c28174 breaks utf8 decode with ignore

openby ryannathansopened 2017-12-05updated 2026-01-16

bugpy-coreunicode

b'\xff\xfe'.decode('utf8', 'ignore')

used to work but now produces UnicodeError exception due to commit https://github.com/micropython/micropython/commit/68c28174d0e0ec3f6b1461aea3a0b6a1b84610bb

Tested on STM32

On CPython this still works as expected:

>>> b'\xff\xfe'.decode('utf8', 'ignore')
''

CANDIDATE · ISSUE

UnicodeDecodeError not raised when expected in bytes.decode()

closedby hiwayopened 2016-12-28updated 2017-09-06

bug

What is expected:

Python 3.5.2/ 3.6.0:

>>> bytes.decode(b"\xa1\x80", 'utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 0: invalid start byte

What is happening:

MicroPython v1.8.6-260-gafc5063-dirty on 2016-12-28; darwin version:

>>> bytes.decode(b"\xa1\x80", 'utf-8')
'\u0840'

I am porting umsgpack (a small pure-python msgpack library, the 'u' is not related to upy) to micropython, and this particular test is failing since Micropython behaves differently from CPython. It may show up elsewhere as surprises if programs continue when they should fail.

Assessment

S Sonnet · high

These are opposite bugs in the same code path. Issue #2734 reports that strict mode (default) silently accepted malformed UTF-8 instead of raising UnicodeDecodeError. Issue #3469 reports that commit 68c28174 — which was almost certainly the fix for #2734 — caused 'ignore' mode to raise UnicodeError instead of silently skipping invalid bytes. Both involve bytes.decode() with invalid UTF-8 in py/objstr, but #2734 is about strict mode being too lenient and #3469 is about ignore mode being too strict. The fix for #2734 introduced the regression in #3469; they are causally linked but describe distinct, opposing defects.

Suggested action

Keep both issues open. When fixing #3469, verify that strict mode (the original #2734 concern) still correctly raises UnicodeDecodeError. The fix should properly thread error mode ('strict', 'ignore', 'replace') through the UTF-8 decoder so each mode behaves as CPython does.

68c28174 breaks utf8 decode with ignore

UnicodeDecodeError not raised when expected in bytes.decode()

Keyboard