QUERY · ISSUE

68c28174 breaks utf8 decode with ignore

openby ryannathansopened 2017-12-05updated 2026-01-16

bugpy-coreunicode

b'\xff\xfe'.decode('utf8', 'ignore')

used to work but now produces UnicodeError exception due to commit https://github.com/micropython/micropython/commit/68c28174d0e0ec3f6b1461aea3a0b6a1b84610bb

Tested on STM32

On CPython this still works as expected:

>>> b'\xff\xfe'.decode('utf8', 'ignore')
''

CANDIDATE · ISSUE

str() will convert invalid utf8 from bytes object

closedby ddiminnieopened 2018-11-20updated 2018-11-26

bug

I've reproduced this behavior on the windows, pyboard (via javascript emulator), and CircuitPython 'atmel-samd' ports. The output of the 'windows' port is shown below.
If a bytes object contains values that represent invalid utf8 (more specifically, invalid continuation characters), CPython will throw an appropriate exception;

Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> b = b"\xf0\xe0\xed\xe8"
>>> s = str(b, "utf8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 0: invalid continuation byte

However, MicroPython (at least, the MicroPython ports mentioned above) will happily perform the conversion, with 'interesting' results:

MicroPython v1.9.4 on 2018-11-19; win32 version
Use Ctrl-D to exit, Ctrl-E for paste mode
>>> b = b"\xf0\xe0\xed\xe8"
>>> s = str(b, "utf8")
>>> len(s)
4
>>> s[0]
'\x00\x00\r\x08'
>>> s[1]
'\x00\r\x08'
>>> s[2]
'\r\x08\x00'
>>> s[3]
'\x08\x00\x16'

What's somewhat disturbing is that the value stored to 's[2]' and 's[3]' in the example above appears to contain the contents of memory outside of the original bytes object (vague flashbacks to the now infamous 'Heartbleeds' bug come to mind). At any rate, this (and similar) example(s) almost certainly ought to trigger an appropriate exception...

68c28174 breaks utf8 decode with ignore

str() will convert invalid utf8 from bytes object

Keyboard