Skip to content

Compare strings the right way

Published on: December 23, 2023    Categories: Django, Python, Unicode

This is part of a series of posts I’m doing as a sort of Python/Django Advent calendar, offering a small tip or piece of information each day from the first Sunday of Advent through Christmas Eve. See the first post for an introduction.

Unicode’s unique complexity

It is the year 2023 — almost 2024! — and hopefully you’re using a programming language that is fully Unicode-aware. Python is; its string type is a sequence of Unicode code points, and has been since Python 3.3 (yes, 3.3 — see this older post for details of why 3.3 and not 3.0, but the short version for those familiar with Unicode is that in Python 3.0, 3.1, and 3.2 the string type, as with the unicode type of Python 2, was a sequence of code units, usually UTF-16 code units but sometimes UTF-32, and didn’t change to code points until Python 3.3),

So let’s get some terminology down right away: a “code point” is the atomic unit of Unicode, and is a sort of abstract thing of which properties can be asserted. It is not the same as a “character”, at least not in the way people typically use the word “character”. The Unicode equivalent of that is a “grapheme”, which might be one or multiple code points (or one code point might be multiple graphemes; it can get complicated, because human language and writing is complicated, and if you want examples look up that older blog post I linked above).

But to be useful to a computer, we need to encode these abstractions, transforming them into sequences of bytes, typically using a Unicode Transformation Format, abbreviated “UTF”. Common transformation formats are UTF-8, UTF-16, and UTF-32. A transformation format in turn consists of “code units”, which are the native-sized units of the format. UTF-16’s code units, for example, are two bytes (16 bits). But it’s also a variable-width encoding, so some code points use one code unit while some others use two, according to a somewhat complex scheme which was retrofitted on when Unicode first expanded past the number of code points that could be represented by a 16-bit integer.

And Unicode often provides more than one way to write the same thing. Mostly this is because Unicode tries to maintain round-trip compatibility with older character sets and encoding systems; the idea is you should be able to convert from the old system, to Unicode, and then back again and get exactly the input you started with. But this means that, for example, if you want to write “é” you can either write it as a “decomposed” sequence made up of an “e” and an accent mark, or as a single “composed” code point which exists for compatibility with older character sets in which each character/accent combination got its own separate representation.

And in fact there are multiple types of equivalence in Unicode: canonical equivalence, and compatibility equivalence. Canonical equivalence applies to, say, the two different ways of writing “é” because they all produce the same visual result and are always interchangeable with one another. But compatibility equivalence applies to things which don’t necessarily look identical and aren’t always interchangeable; for example, the sequence 1⁄2 and the single code point ½ have compatibility equivalence.

Unicode provides a concept of normalization for dealing with this and making sequences that “should be” equivalent actually equivalent, and rules for four different normalization forms (depending on what type of equivalence you want to use and whether you want the result to use composed or decomposed forms).

There are also multiple ways of handling case in Unicode; I’ve written in detail about case in Unicode before, but the gist of it is that there are multiple different definitions of it which can give different answers to questions like “is this uppercase” or “is this lowercase”, depending on what you need (for example, you might want to always get an answer even if a given code point is not really “cased”). Unicode defines case mappings, and in particular defines one called “case folding”, which has the useful property that two sequences of code points which differ only in case will be equal after both have been case folded (don’t fall into the trap of thinking case folding is like lowercasing; although many case-folded strings look lowercase to speakers of many European languages, there are scripts — Cherokee is an example — for which case folding produces an uppercase-looking result).

Getting to the (code) point

All of which means that, unfortunately, comparing two strings in a Unicode-aware world isn’t as simple as using the == operator, because you might be dealing with strings that ought to be considered equivalent, or whether you want to consider case, etc.

The unicodedata module in the Python standard library provides an implementation of Unicode’s normalization rules, and Python’s str type provides a casefold() method in addition to the lower() and upper() (and title()) case transformations. If you’re unsure what to pick, a good general piece of advice is to use normalization form NFKC unless and until you know you want something else, and to case fold after normalizing for comparisons which need to be case-insensitive. So you could do something like this:

from unicodedata import normalize

def compare(s1: str, s2: str) -> bool:
    return normalize("NFKC", s1).casefold() == normalize("NFKC", s2).casefold()

This is about as safe as you can get, within just the Python standard library, for comparing things like identifiers (say, usernames) without running into problems with case or equivalence. Values which will be used for comparison should also generally be stored in normalized format — the default user model in django.contrib.auth, for example, will normalize the username, and the AbstractBaseUser class provides a method which performs the correct normalization (though it does not case-fold the username).