Micropython allows creation of non UTF-8 identifiers
Port, board and/or hardware
unix, standard build
MicroPython version
MicroPython v1.28.0-preview.18.g6341258207 on 2025-12-26; linux [GCC 14.2.0] version
Reproduction
In the repl, use exec() on a bytestring with invalid UTF-8:
>>> exec(b"a\xff = None")
>>> dir()
['__name__', 'a\x00']
Expected behaviour
An exception should be issued like https://github.com/micropython/micropython/pull/17862 wanted to do.
Observed behaviour
An identifier whose content is not valid UTF-8 text is created and can be seen e.g., in dir().
Additional Information
This is separate from exactly following CPython rules for which Unicode code points can form identifiers. E.g., micropython accepts 💡= True but CPython rejects it.
Code of Conduct
Yes, I agree
various: Don't allow creation of invalid UTF8 strings or identifiers
Summary
Fuzz testing found that it was possible to create invalid UTF-8 strings when the program input was not UTF-8. This could occur because a disk file was not UTF-8, or because a byte string passed to eval()/exec() was not UTF-8.
Besides leading to the problems that the introduction of utf8_check was intended to fix (#9044), the fuzzer found an actual crash when the first byte was \xff and the string was used as an exception argument (#17855).
I also noticed that the check could be generalized a little to avoid constructing non-UTF-8 identifiers, which could also lead to problems.
I re-organized the code to pay for the size cost of the new check in the lexer.
Testing
I added a new test, using eval() and exec() of byte strings, to ensure that these cases are caught by the lexer.
Trade-offs and Alternatives
Could check that the whole code buffer is UTF-8 instead.