Micropython allows creation of non UTF-8 identifiers
Port, board and/or hardware
unix, standard build
MicroPython version
MicroPython v1.28.0-preview.18.g6341258207 on 2025-12-26; linux [GCC 14.2.0] version
Reproduction
In the repl, use exec() on a bytestring with invalid UTF-8:
>>> exec(b"a\xff = None")
>>> dir()
['__name__', 'a\x00']
Expected behaviour
An exception should be issued like https://github.com/micropython/micropython/pull/17862 wanted to do.
Observed behaviour
An identifier whose content is not valid UTF-8 text is created and can be seen e.g., in dir().
Additional Information
This is separate from exactly following CPython rules for which Unicode code points can form identifiers. E.g., micropython accepts 💡= True but CPython rejects it.
Code of Conduct
Yes, I agree
Crash printing exception detail when source code is not valid UTF-8
Port, board and/or hardware
unix port, coverage build, x86_64 linux
MicroPython version
MicroPython v1.26.0-preview.524.g255d74b5a8 on 2025-08-06; linux [GCC 12.2.0] version
Reproduction
# Slight changes (like removing the derived exception type) move the misbehavior
# around. For instance, in my local build, not having this triggers
# 'NotImplementedError: opcode' instead.
class Dummy(BaseException):
pass
# Smuggle invalid UTF-8 string into decompress_error_text_maybe
# This invalid UTF-8 string acts matches the test MP_IS_COMPRESSED_ROM_STRING
# This can also happen if the input file is not a valid UTF-8 file.
b = eval(b"'\xff" + b"\xfe" * 4096 + b"'")
try:
raise BaseException(b)
except BaseException as good:
print(type(good), good.args[0])
Expected behaviour
CPython fails the eval() with SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte.
Observed behaviour
The invalid utf-8 string can be successfully created. When fetching the exception's args property, a crash occurs inside of mp_decompress_rom_string (which should never have been called). The call occurs because the first byte of the invalid UTF-8 string is \xff, the marker for compressed ROM strings.
Additional Information
Found via fuzzer, manually minimized.
Code of Conduct
Yes, I agree