QUERY · ISSUE

Micropython allows creation of non UTF-8 identifiers

openby jepleropened 2025-12-26updated 2026-01-05

bugunicode

Port, board and/or hardware

unix, standard build

MicroPython version

MicroPython v1.28.0-preview.18.g6341258207 on 2025-12-26; linux [GCC 14.2.0] version

Reproduction

In the repl, use exec() on a bytestring with invalid UTF-8:

>>> exec(b"a\xff = None")
>>> dir()
['__name__', 'a\x00']

Expected behaviour

An exception should be issued like https://github.com/micropython/micropython/pull/17862 wanted to do.

Observed behaviour

An identifier whose content is not valid UTF-8 text is created and can be seen e.g., in dir().

Additional Information

This is separate from exactly following CPython rules for which Unicode code points can form identifiers. E.g., micropython accepts 💡= True but CPython rejects it.

Code of Conduct

Yes, I agree

CANDIDATE · PULL REQUEST

various: Don't allow creation of invalid UTF8 strings or identifiers

closedby jepleropened 2025-08-07updated 2025-12-26

py-coreunicode

Summary

Fuzz testing found that it was possible to create invalid UTF-8 strings when the program input was not UTF-8. This could occur because a disk file was not UTF-8, or because a byte string passed to eval()/exec() was not UTF-8.

Besides leading to the problems that the introduction of utf8_check was intended to fix (#9044), the fuzzer found an actual crash when the first byte was \xff and the string was used as an exception argument (#17855).

I also noticed that the check could be generalized a little to avoid constructing non-UTF-8 identifiers, which could also lead to problems.

I re-organized the code to pay for the size cost of the new check in the lexer.

Testing

I added a new test, using eval() and exec() of byte strings, to ensure that these cases are caught by the lexer.

Trade-offs and Alternatives

Could check that the whole code buffer is UTF-8 instead.