68c28174 breaks utf8 decode with ignore
b'\xff\xfe'.decode('utf8', 'ignore')
used to work but now produces UnicodeError exception due to commit https://github.com/micropython/micropython/commit/68c28174d0e0ec3f6b1461aea3a0b6a1b84610bb
Tested on STM32
On CPython this still works as expected:
>>> b'\xff\xfe'.decode('utf8', 'ignore')
''
str() will convert invalid utf8 from bytes object
I've reproduced this behavior on the windows, pyboard (via javascript emulator), and CircuitPython 'atmel-samd' ports. The output of the 'windows' port is shown below.
If a bytes object contains values that represent invalid utf8 (more specifically, invalid continuation characters), CPython will throw an appropriate exception;
Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> b = b"\xf0\xe0\xed\xe8"
>>> s = str(b, "utf8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 0: invalid continuation byte
However, MicroPython (at least, the MicroPython ports mentioned above) will happily perform the conversion, with 'interesting' results:
MicroPython v1.9.4 on 2018-11-19; win32 version
Use Ctrl-D to exit, Ctrl-E for paste mode
>>> b = b"\xf0\xe0\xed\xe8"
>>> s = str(b, "utf8")
>>> len(s)
4
>>> s[0]
'\x00\x00\r\x08'
>>> s[1]
'\x00\r\x08'
>>> s[2]
'\r\x08\x00'
>>> s[3]
'\x08\x00\x16'
What's somewhat disturbing is that the value stored to 's[2]' and 's[3]' in the example above appears to contain the contents of memory outside of the original bytes object (vague flashbacks to the now infamous 'Heartbleeds' bug come to mind). At any rate, this (and similar) example(s) almost certainly ought to trigger an appropriate exception...