On Python 3, again

June 10, 2016 Django, Pedantics, Programming, Python, Unicode

A while back, Brett Cannon went into some detail on why Python 3 happened (that is, why it was backwards-incompatible with Python 2 and why it was backwards-incompatible in the particular ways it was). Python 3 has been pretty controversial, with people I respect chiming in on both sides of the good idea/bad idea debate. And the transition (which Brett has also taken a recent look at) has indeed been slow, but that at least was fully expected — at the time, I remember announcements coming from the Python core team talking about a transition period of at least five years. Currently, Python 2.7 is scheduled to receive official security support until 2020, with Red Hat and probably other vendors providing third-party support for even longer.

But although a lot of things changed going from Python 2 to Python 3, most of them have been accepted as positive minor cleanups. The standard library got a bit better organized. Python 2.2’s “new-style” classes, which are what you want even if you don’t know what that means, are now the default. The print statement became the print() function (OK, some people gripe about that, but it’s mostly just sensible consistency).

The big one, the one that people argue endlessly about, the one that causes endless consternation, is the change to strings.

I’d been bitten more than once by the weirdness of Python 2’s string handling, so I was in favor of cleaning that up. But for a long time I was mostly on the sidelines of the debate; my previous job was Python 2 only, and I didn’t have a ton of free time to look into porting open-source stuff I’d released. That changed this past summer, when I suddenly had (thanks to being job-hunting) the time to start porting my personal projects, and more recently when I started a my new job where I’m using Python 3 every day. So I figure it’s a good time to look over this again and say… that I’m still 100% in favor of Python 3, and the string changes.

Secure the perimeter

Are there things that are harder to do in Python 3 because of the string changes? Absolutely. People who write command-line Unix utilities, some types of file handling applications, and code in some other domains suddenly have a tougher time writing code Python will accept than they did when bytes was named str. I’m not going to downplay that at all, and in fact it’s the biggest relevant-to-actually-writing-code complaint there is about the change. But as with so many things, it’s a trade-off, and programming in Python didn’t, as a whole, get harder. What happened was more of a shuffling around of where the hard bits are: quite a lot of people who used to have to worry constantly about strings and bytes and encodings and the whole mess of baggage that comes with them now don’t have to worry about that as much. In exchange, some people who used to rarely have to worry about it when writing their code now have to worry about it all the time.

Of course, if that was all there was to the change, then it would have been bad: arbitrarily shifting difficulty from one group of programmers to another isn’t really a good reason for a major change in a language. Even if we can quantify how many people have it easier vs. how many have it harder, it still might be difficult to justify depending on the exact numbers and which groups got disproportionately hit (for example, making life much harder for brand-new programmers in exchange for ease for experienced people might not be a great trade to make, but dumping all the difficulty on the developers of vital low-level tools for everyone else is also a problem).

But the more I’ve explored and thought about it, the more I think arguing about particular domains or numbers of people is actually the wrong way to think about this. The real questions that matter are:

At what point should string encoding be an issue a developer has to think about?
Is that the point where developers now have to think about it?

And I think Python 3 produced the right answers to those questions, and baked them into the implementation. The place where developers should have to think about strings is at boundaries, at the points where data leaves or comes under the control of the developer (or the framework or the library being used by the developer).

Digging back into history a bit, this was the policy Django settled on, well before Python 3 came along: in code that you, an application developer, will write with Django, you’ll be working with Unicode everywhere. Text data coming in through HTTP requests gets turned into Unicode. Text data coming out of your database gets turned into Unicode. Functions and methods that work on text are built for Unicode. And when you push a response back out, or send data back to your database, or do whatever thing your application will actually do, it gets turned back into the appropriate bytes on the way out the door.

This was incredibly difficult to set up initially, was backwards-incompatible with how Django had worked previously, and had a large cost in developer time and sanity to rebuild the framework around Unicode. But it was worth it, because it was the right thing. These boundaries simply are the places where encoding and bytes-vs.-Unicode issues should happen. Everywhere else, string data should simply be Unicode, period, and tools should always be built to ensure that.

Unicode and its discontents

And for the most part, if you look at complaints coming from people who say Python 3’s string handling made their lives harder, you’ll see that what really happened was exposure of a boundary they previously weren’t thinking about.

Now, I should point out here that I’m not really knocking the people who were writing, say, command-line and file-handling utilities in Python. For years, Python sort of accepted the status quo of the Unix world, which was mostly to stick its fingers in its ears and shout LA LA LA I CAN’T HEAR YOU I’M JUST GOING TO SET LC_CTYPE TO C AGAIN AND GO BACK TO MY HAPPY PLACE. A bit later on it changed to “just use UTF-8 everywhere, UTF-8 is perfectly safe”, which really meant “just use UTF-8 everywhere because we can continue pretending it’s ASCII up until the time someone gives us a non-ASCII or multi-byte character, at which point do the fingers-in-ears-can’t-hear-you thing again”.

So a lot of what you’ll see in terms of complaints about string handling are really complaints that Unix’s pretend-everything-is-ASCII-until-it-breaks approach was never very good to begin with and just gets worse with every passing year.

As of Python 3, suddenly you do have to think about the boundary between your program and the filesystem (or the socket, or other Unix constructs) and think about what encoding text will be in when it arrives and what encoding it needs to be as it departs. To people who are used to just arbitrarily grabbing a file handle and doing read() and write() on it this is kind of a rude awakening. But it’s also the right place to require thinking about that. I’m aware some smart people disagree with me on this; Armin seems to think Python is fighting a losing battle by trying to ask Unix to become Unicode-aware, for example.

But even if we set aside the Unix problem, we’re increasingly forced to accept that Unicode isn’t going away, and that acting like everything is ASCII until we learn otherwise doesn’t work as a strategy.

A phone-screen question

When I was job hunting last year, I did a lot of phone screens (and toward the end I adopted a policy, for my own sanity, of hanging up on certain types of phone screens, but that’s another story). There really aren’t that many phone-screen problems out there, so I saw some of them multiple times, including the palindrome question. It comes in a number of variations, but most of them center around proving that you know how to detect a palindrome — a string which reads the same forwards and backwards. Some of the utility here is in getting the candidate to ask questions about things like spacing and punctuation and capitalization, since some famous multi-word palindromes only work when ignoring those (like “Able was I ere I saw Elba” or “A man, a plan, a canal — Panama!”).

On about the third phone screen I’d gotten with a palindrome problem, I decided to throw in a wrinkle none of the interviewers had ever mentioned, asked about or expected to be asked about: OK, so what about Unicode?

The naïve way to check for a palindrome is simply to compare the sequence of characters in both directions. For example:

def is_palindrome(s):
    return s == s[::-1]

And that’s the sort of solution most phone-screen palindrome questions are going for (you can do it a few different ways, but the slice-with-negative-step version is the most concise and “native” implementation in Python). But let’s throw a wrench in this. Suppose I have two strings I’ve named palindrome1 and palindrome2, and throw them through the function above:

>>> print(palindrome1)
aña
>> is_palindrome(palindrome1)
True
>>> print(palindrome2)
aña
>>> is_palindrome(palindrome2)
False

Oops.

These strings are both palindromes when considered in terms of graphemes, but only one of them is a palindrome when considered as a sequence of Unicode characters. Here’s what those strings actually are:

>>> palindrome1 = u"a\u00f1a"
>>> palindrome2 = u"an\u0303a"

The first one is three Unicode characters long. The second character in it is U+00F1, officially known as “LATIN SMALL LETTER N WITH TILDE”. The second string is four characters long; its second character is just a lowercase “n”, but its third character is U+0303 — that’s “COMBINING TILDE” to you. The combining tilde goes with the character it comes after, so when we reverse that string we get:

>>> print(palindrome2[::-1])
ãna

since the combining tilde came after the “a” this time, not after the “n”.

If you want to impress your next phone screener, here’s a better (but still not perfect) version of is_palindrome():

import unicodedata

def is_palindrome(s):
    if any(unicodedata.combining(c) for c in s):
        s = unicodedata.normalize('NFC', s)
    return s == s[::-1]

Yes, writing a correct palindrome checker requires (at least) knowledge of Unicode character classes and normalization. Getting it as correct as possible requires more (especially since the solution above only works for combining characters which reliably have a composed form), but this is a good start and will probably distract any interviewer you’re dealing with long enough to let you move on to something actually relevant to your job qualifications. Just make sure you can answer or bluff your way through any questions you’re asked about it, like why it uses normalization form NFC instead of NFKC.

Yeah, but everybody around here speaks English!

You probably already know that if your company ever offers something for sale in Canada, you’ll need to get bilingual in a hurry. Or that if any part of your infrastructure or supply chain passes through Mexico you’re going to need the ability to work in Spanish.

But even if you live in the United States, even if your full supply chain is in the United States, even if all your co-workers, customers and users are in the United States, and even if you never plan to expand beyond the United States, you’ll still run into some inconvenient facts:

One out of every five people in the US speaks a language other than English at home.
In at least five states I know of — two of which might surprise you — languages other than English have official or otherwise specially-recognized legal status.
Of languages with more than one million speakers in the US, two don’t use any variation of Latin script, and one uses Latin characters but with so many diacritics your head will swim.
And even English text often contains loanwords or intersperses names or words from other languages, requiring non-ASCII characters to represent.

And there are entire industries in the US which are required by law to be accessible to their customers in the customers’ preferred language. I work in one of them!

Plus, of course, there’s this thing you may have heard of, called “emoji”. Emoji are the real torture test since even some vaguely Unicode-aware approaches fall over when they encounter characters outside the Basic Multilingual Plane, or which make use of some of the combining features required to make modern emoji work correctly.

The result is that every piece of code you write which handles text but doesn’t think up-front about character sets and encoding is a ticking time bomb. Every piece of code you write that assumes everything is ASCII, or assumes everything is UTF-8 staying in the one-byte-per-character range, is a ticking time bomb.

And that’s without getting into the security implications of handling Unicode incorrectly. You have read Unicode Technical Report 36, right? Right?

Rant over

I could write a lot more about the adventures I’ve had with text handling over the course of my career. So could many of my colleagues and co-workers, both present and former (ask Jannis about the time he got to set up and install a new MySQL collation! On second thought, don’t — just buy him a drink or give him a hug). But there’s a common theme to most of those adventures, and that theme is: not asking questions up-front about how text was really being handled in code I was working on or with. I got into those situations, over and over again, because I was working in Python 2, and Python 2 didn’t make me ask those questions.

Python 3 does. Text-handling problems in Python 3 are up-front and in my face. And in your face. When you’re doing something with text that could be unsafe, Python 3 often tells you right away instead of letting you get away with it for the moment, meaning you discover many of those vectors for bugs immediately instead of at 3AM on a Sunday when your pager goes off.

Does being forced to think about these problems up-front make the process of writing code harder? Yes. Are some domains now objectively more difficult to write code for due to the requirement to think about these problems up-front? Yes. But is it better to think about and solve these problems up-front? Yes. Do I think the tradeoff of having to handle text correctly, at the cost of more up-front work, is worth making? Absolutely. Would I ever willingly go back to how I used to write code, littered with text-handling land mines that I might not find until weeks, months or even years later, usually at the least convenient time possible? Not a chance.

So yeah. I’m still in favor of Python 3, string changes and all.