Micropython allows creation of non UTF-8 identifiers
Port, board and/or hardware
unix, standard build
MicroPython version
MicroPython v1.28.0-preview.18.g6341258207 on 2025-12-26; linux [GCC 14.2.0] version
Reproduction
In the repl, use exec() on a bytestring with invalid UTF-8:
>>> exec(b"a\xff = None")
>>> dir()
['__name__', 'a\x00']
Expected behaviour
An exception should be issued like https://github.com/micropython/micropython/pull/17862 wanted to do.
Observed behaviour
An identifier whose content is not valid UTF-8 text is created and can be seen e.g., in dir().
Additional Information
This is separate from exactly following CPython rules for which Unicode code points can form identifiers. E.g., micropython accepts 💡= True but CPython rejects it.
Code of Conduct
Yes, I agree
formatting character values >= 128 gives unexpected results, can crash
$ ./build-coverage/micropython
MicroPython 9c7067d9ad on 2023-11-28; linux [GCC 12.2.0] version
Use Ctrl-D to exit, Ctrl-E for paste mode
>>> s = f"{160:c}"
>>> len(s)
0
>>> s
' '
>>> print(s)
�
The same for s = "%c" % 160.
I think this is because the 'c' formatter in objstr.c doesn't handle non-ASCII characters properly. It looks like this ends up being another way to get improper UTF-8 into a str() object, too.
This can lead to a crash when an invalid string beginning with the byte 255 is generated, just like #17855:
MicroPython v1.27.0-preview.95.g9939565d50 on 2025-09-03; linux [GCC 14.2.0] version
Use Ctrl-D to exit, Ctrl-E for paste mode
>>> raise ValueError(f"{255:c}" + f"{254:c}" * 4096)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: Segmentation fault
Field widths function by bytes, not code points, so you can also produce improper utf-8 with print('%.1s' % chr(233)).