How Python does Unicode
As we all (hopefully) know by now, Python 3 made a significant change to how strings work in the language. I’m on the record as being strongly in favor of this change, and I’ve written at length about why I think it was the right thing to do. But for those who’ve been living under a rock the past ten years or so, here’s a brief summary, because it’s relevant to what I want to go into today:
In Python 2, two types could be used to represent strings. One of them,
str, was a “byte string” type; it represented a sequence of bytes in some particular text encoding, and defaulted to ASCII. The other,
unicode, was (as the name implies) a Unicode string type. Thus it did not represent any particular encoding (or did it? Keep reading to find out!). In Python 2, many operations allowed you to use either type, many comparisons worked even on strings of different types, and
unicode were both subclasses of a common base class,
basestring. To create a
str in Python 2, you can use the
str() built-in, or string-literal syntax, like so:
my_string = 'This is my string.'. To create an instance of
unicode, you can use the
unicode() built-in, or prefix a string literal with a
u, like so:
my_unicode = u'This is my Unicode string.'.
In Python 3, there is one and only one string type. Its name is
str and it’s Unicode. Sequences of bytes are now represented by a type called
bytes instance — if it happens to represent ASCII-encoded text — does still support some operations which let you use it as ASCII, but things that accept strings will generally not accept
bytes anymore, and comparisons and other operations no longer work between
str instances. There are even command-line flags to the Python interpreter which can promote failed mixing of
str into an exception that crashes your program, if you want.
This is a good thing. Unicode is the correct string type, and encoding/decoding should be handled at boundaries (like network interfaces or the filesystem), and should not be things you constantly have to worry about every time you’re dealing with a string. And as a side effect, the
bytes type is actually a lot more useful for working with bytes than Python 2’s
str type ever was.
But there’s another change in the Python 3 series — one which happened in Python 3.3 — that’s not as well known but also important. And that’s what I want to talk about today, but first we need one more bit of background.
Unicode transformation (not the Michael Bay kind)
If you already know how encodings of Unicode — specifically, UTF-8, UTF-16, and UTF-32 — work, you can skip this section. If you don’t know about that, keep reading!
As you may know, Unicode is not an encoding. It makes sense to talk about a string that’s encoded as ASCII, or encoded as ISO-8859-2, or encoded as Shift JIS, because those are all encodings. It makes no sense to talk about a string encoded as Unicode, because Unicode is not an encoding. Unicode is more like a big database of characters and properties and rules, and on its own says nothing about how to encode a sequence of characters into a sequence of bytes for actual use by a computer.
And when you get right down to it, Unicode doesn’t work in terms of characters; it works in terms of things called code points. If you were to render a Unicode sequence for display, say, on your screen, you’d discover there’s not a clean mapping between code points and what you consider to be “characters”; sometimes a “character” might be a single code point, sometimes it might be made up of multiple code points. Or something that looks to you like multiple “characters” might be a single code point!
Here are some examples:
- Code point
U+0041, LATIN CAPITAL LETTER A, is, well, a Latin capital letter A. It looks like this: A
- Code point
U+0327COMBINING CEDILLA, on the other hand, really needs something else to go with it. Put it after a
U+0063LATIN SMALL LETTER C, and you get a “character” made up of multiple code points: ç, which is also equivalent to the single code point
U+00E7LATIN SMALL LETTER C WITH CEDILLA.
U+00E7is what Unicode calls a “composed form” and
U+0063 U+0327is a “decomposed form”. Unicode defines rules for how to compose and decompose sequences of code points into each other, and when different sequences should be considered equivalent or normalized to the same thing.
- So how about a single code point that looks like multiple “characters”? Well, let’s turn to
U+FDFAARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM, which looks like this (assuming you have a font that supports it): ﷺ, and — if you follow Unicode’s rules for how to decompose it — is nearly twenty “characters” in a single code point. Its presence as a single code point is due to its frequent use; it represents the phrase “peace be upon him”.
If you want to learn how to speak Unicode, by the way, the word you want to use for “character” is usually going to be “grapheme”. In Unicode, a “grapheme” is defined as a “minimally distinctive unit of writing in the context of a particular writing system”, with the Unicode glossary suggesting “b” and “d” as an example for English, where changing one to the other produces a different word (as in “big” versus “dig”).
So Unicode is code points, which are usually given as hexadecimal representations of integers. And Unicode organizes these into “planes” of 216 (that’s 65,536) code points each, broken up into “blocks” of 28 (256) code points each. There are currently 17 planes defined in Unicode, though only a few have anything assigned to their constituent code points.
How might we go about representing this in sequences of bytes? One idea might be to just spit out a sequence of bytes representing the integer values of the Unicode code points; the smallest number of bytes capable of representing any arbitrarily-chosen Unicode code point is 3 (the highest possible code point requires 21 bits to represent as an integer), but ideally we want a nice power of two, so we could represent the code points as 4-byte integers, and call it a day.
And in fact this is pretty much the definition of UTF-32 (that’s “Unicode Transformation Format, 32-bit”). UTF-32 is also sometimes referred to as UCS-4 (“Universal Coded Character Set, 4-byte” — UCS is an ISO/IEC standard similar to Unicode, and sharing the assignments of code points, but without the full set of properties and rules defined by Unicode). So that’s an option, but it does eat up a lot of space; most real-world text data doesn’t use code points high enough to need this many bytes, so it can seem — and be — wasteful to insist on using four bytes even when it’s not necessary.
Once upon a time, when the set of Unicode code points hadn’t grown past the limit of a 16-bit integer, there was an encoding called UCS-2, where each code point was represented literally as a 16-bit/two-byte integer. Today, though, Unicode contains code points outside the range of a 16-bit integer (in other words, outside of the lowest-numbered plane, officially known as the Basic Multilingual Plane), so that no longer works quite as elegantly. However, with a bit of shoehorning, it’s still possible to use an encoding based on two-byte integer values; that system is called UTF-16 (Unicode Transformation Format, 16-bit). The way UTF-16 works is:
- If a code point can be represented as a 16-bit integer, encode it as that 16-bit integer and you’re done.
- If not, subtract
0x010000from the code point, then take the first ten bits of the result and add
0xD800, and take the last ten bits and add
0xDC00. This will produce two 16-bit values, the first in the range
0xD800-0xDBFFand the second in the range
0xDC00-0xDFFF. These two values are used to represent the code point (which now is encoded by four bytes total).
The two 16-bit values produced for code points that can’t be cleanly handled in a single 16-bit unit are known as “surrogate pairs”. Using surrogate pairs has some advantages: it’s immediately obvious when you’re looking at them, since there’s a particular range that they — and only they — fall into, and since the first and second parts of the surrogate pair fall into different ranges, you can also figure out, if you’re dumped at a random point in a UTF-16 stream, whether you’re looking at a surrogate pair and whether you’re looking at the first or the second half of one.
However, this did require some changes to Unicode itself, because it means UTF-16 can only handle code points representable by 21 bits or fewer (that’s as much as can be encoded using the ranges the surrogate pairs fall into), and there’s a range of code point in the lowest-numbered plane which can never be assigned, since UTF-16 is using them for surrogate pairs. Unicode itself now guarantees that the ranges used for surrogate pairs will never be assigned, and that no code points larger than 21 bits will ever be assigned, in order to ensure UTF-16 will always be able encode all of Unicode.
It’s worth noting, however, that because of this little workaround UTF-16 is a variable-width encoding. Where UTF-32 always uses four bytes, and where UCS-2 always used two (since at the time two bytes was enough), UTF-16 sometimes uses two bytes and sometimes uses four, and you have to pay attention to the values you’re consuming in order to know which is which.
UTF-16 also makes use of the code point
ZERO WIDTH NO-BREAK SPACE, as a byte-order mark to distinguish endianness (on a big-endian system it will read as
FE FF, while on little-endian it reads as
FF FE). UTF-16-encoded streams of text usually start with this value and it’s an easy way to detect files containing UTF-16-encoded contents. UTF-32 also can use the byte-order mark, but real-world use of UTF-32 for storage and transmission of data is rare.
However, this is still slightly wasteful, because — at least in the English-speaking world — a lot of text can be represented with code points that fit in one byte, and a lot of systems were originally built on ASCII, which didn’t even use the full 8 bits of a single byte. Unicode thoughtfully put the 128 characters of ASCII into the first 128 code points of the first block of the first plane, but to take advantage of that requires an encoding that makes those code points come out exactly equivalent to a sequence of ASCII bytes. That encoding is UTF-8.
In UTF-8, a single code point may require anywhere from 1-4 bytes to encode. The way this works is:
- If the code point is less than or equal to
0x7F, encode it as-is, in a single byte, leaving the high (first, if you’re looking at the output of, say, Python’s
bin()built-in) bit set to zero.
- If the code point is greater than
0x7Fbut less than or equal to
0x07FF, encode it as two bytes. The first byte begins
110, followed by the first five bits of the code point, and the second byte begins
10followed by the remaining bits of the code point.
- If the code point is greater than
0x07FFbut less than or equal to
0xFFFF, encode it as three bytes. The first byte begins with
1110followed by the first four bits of the code point, and the second and third bytes each begin with
10followed by six bits from the code point.
- For all other code points, encode as four bytes. The first byte begins with
11110followed by the first three bits of the code point, then each of the other three bytes begins
10followed by six bits of the code point.
This can encode any 21-bit code point in at most four bytes (since it uses at most 11 bits to signal how the code point is spread across multiple bytes, and 32 - 11 bits leaves 21 bits to store the value of the code point). UTF-8 can theoretically be extended out to encode larger code points if Unicode ever were to assign them, but since Unicode itself will now never go over 21 bits, UTF-8 can handle any code point in four bytes or fewer. Though, again, UTF-8 is a variable-width encoding; any given code point might require 1, 2, 3, or 4 bytes.
But since the first 128 code points are encoded exactly as themselves, in one byte with the high bit set to zero, a UTF-8 sequence which contains only code points in that range is identical to the equivalent ASCII, which is something a lot of people like.
Also, for reasons known only to Microsoft, many Windows programs insert the byte-order mark in UTF-8-encoded files. The code point
U+FEFF comes out as the three-byte sequence
EF BB BF in UTF-8, just in case you ever need to check for it.
In the real world, UTF-8 and UTF-16 are common choices of encoding for systems which use Unicode; Windows uses UTF-16 for many APIs (though technically not for filesystem paths, which — although they’re made up of 16-bit units — are not specified as needing to be valid UTF-16, a common pitfall that, along with other misuses, inspired an alternate encoding mechanism called “WTF-8”), while modern Unix/Linux systems mostly seem to go UTF-8.
What does this have to do with Python?
And now we can talk about Python strings. In Python 2, and in Python 3 prior to 3.3, Python had exactly two options for how Unicode strings (
unicode on Python 2,
str on Python 3) would be stored in memory. The choice was made at the time your Python interpreter was compiled, and would produce either a “narrow” or a “wide” build of Python. In a “narrow” build, Python would internally store Unicode in a two-byte encoding with surrogate pairs. In a “wide” build, Python would internally store Unicode in a four-byte encoding.
This can lead to some unexpected outcomes. For example, consider the following (on a pre-3.3 “narrow” build of Python):
>>> s = u"\U0001F4A9" >>> len(s) 2
What’s worse, iterating over it will yield two “characters”, which Python will tell you have code points
DCA9. If you haven’t worked it out yet, that’s the big clue, because both of those values are in the surrogate-pair range. Since this Python interpreter was compiled with “narrow” Unicode, it stores the string using two-byte encoding with surrogate pairs, and does nothing to warn you when you encounter one of them. It just gives you the two code points
D83D DCA9, which are the surrogate pair for
(and of course code point
U+1F4A9 is everyone’s favorite, 💩, also known as PILE OF POO)
But how to do this? The only fixed-width encoding available is UTF-32, which wastes a lot of space on most text. And while it might be possible to just have Python transparently convert surrogate pairs into code points for the high-level API, it’d still be an issue for anything using lower-level (i.e., C) APIs to work with Python strings.
Python 3.3 solved all of these problems by taking a new approach to internal storage of Unicode. Gone are the “wide” and “narrow” builds, gone are the messy leaks of surrogate pairs into high-level code. In Python 3.3 and later, the internal storage of Unicode is now dynamic and chosen on a per-string basis. Here’s how it works:
- Python parses source code on the assumption that it’s UTF-8.
- When it needs to create string objects, Python determines the highest code point in the string, and looks at the size of the encoding needed to store that code point as-is.
- Python then chooses that encoding — which will be one of latin-1, UCS-2, or UCS-4 — to store the string.
All the gory details, if you want to read about them, are in PEP 393.
One obvious advantage of this is that Python strings now use less memory on average; many strings consist solely of code points which can be encoded to one byte, and now they will be encoded to just one byte each (as opposed to previously, when they’d use at least two).
More importantly, a string consisting solely of
U+1F4A9 PILE OF POO is now a string of length 1, as it should be. And the API is cleaner, too:
- Iterating a Python string is iteration over code points.
- Indexing into a Python string yields whatever corresponds to the code point at the index.
- The length of a Python string is the number of code points it contains.
This is all possible — or at least much easier to do — because Python strings are now, internally, always fixed-width. It also nicely mirrors the behavior of
bytes objects, where iteration and indexing yield the integer values of the bytes.
So what’s the point?
This change may not seem like such a big deal, especially when compared to the upheaval involved in moving from Python 2 to Python 3. And in terms of its visible impact on end users of Python, it probably isn’t; unless you’re carefully measuring your Python programs’ memory use and that memory use is dominated by string storage, you won’t notice much advantage from the fact that Python can now store many strings using only one byte per code point. And unless you were routinely working with code points outside the Basic Multilingual Plane, you probably never would have noticed that length and iteration and indexing for them was weird.
But it is still an important change. And it’s one that makes Python more right. I know it’s popular these days to promote UTF-8 as the One True Encoding, and that several popular new-ish languages are using UTF-8 as the backing storage of their string type (and as a result, exposing some of the quirks of variable-width encoding to users of those languages). But for a high-level language, I am increasingly of the opinion that Python’s approach is correct.
Unicode, for people who don’t already have at least a passing familiarity with how it already works, and especially if they’re coming from a world of ASCII or one of the popular 8-bit encodings, can seem very weird, and a lot of programmers already fall into one of those categories. Adding in the quirks of how Unicode actually gets represented as bytes in memory, for a language where manual memory management and other lower-level programming tasks are rare, is probably imposing too much cognitive load.
So abstracting away the storage, and providing a single clear API in terms of Unicode code points, is the right thing to do. And PEP 393’s change to the internal storage of strings was another step down the path of Doing the Right Thing™ in Python’s history, and I think more people should know about it.