CANDIDATE · ISSUE

Unicode support and PEP 393

closedby Rosuavopened 2014-06-03updated 2014-06-28

Opening this as a discussion issue, so it can all be kept track of.

Python 3.3's str type supports the full Unicode range, with semantics defined by PEP 393 http://www.python.org/dev/peps/pep-0393/ (although some of the details there are CPython-specific). Currently, micropython pretends that strings are bytes, C-style, and will output them to a console without modification - so, for instance, a Unix console will interpret "\xC3\xBD" as U+00FD LATIN SMALL LETTER Y WITH ACUTE. (I have no idea what embedded devices do, but presumably it's ASCII-compatible or this issue would have come up long ago.)

Ideally and ultimately, micropython should support all of Unicode. The advantages to the language are huge (if you need me to elaborate, I can do so); in brief, Python 3 forces everyone to be correct. Correctness in Unicode is on par with correctness in memory management; it has some costs, but we willingly pay those costs as the price of guaranteeing that we won't leak memory or have buffer overruns.

But if that can't be done, or can't be done immediately, I'd like to see some means of catching problems before they happen; for instance, documenting that all encodings used MUST be ASCII-compatible, and raising an exception if a str has any character >127 in it.

I've had a bit of a look at objstr.c, and it seems that the character/byte equivalence is, unfortunately, endemic. Not only is the representation all byte-based, but helpers like is_ws() are defined by ASCII. (In CPython, "spam\xA0spam\u3000spam".split() == ["spam","spam","spam"], because U+00A0 and U+3000 are flagged whitespace.) This could be changed, but it will likely mean significant changes, and will almost certainly result in code size increases; although one of the beauties of PEP 393 strings is that, for ASCII-only strings (and even Latin-1 strings), the string in memory is no larger than it would be if stored as bytes (modulo the two-bit flag in the header, stating what the size is).

The most important question is, how much do other parts of the code dip into strings, and therefore how much impact will a change of internal representation have? I tried adding an arbitrary member to the structure, and it seemed to compile okay, and there don't seem to be any other files referencing the structure directly.

How do you feel about me doing up some approximation of PEP 393 into objstr.c? It'd be a fairly significant change.

Discussion of Python 3.7 support

Unicode support and PEP 393

Keyboard