Let’s talk about usernames

February 11, 2018 Django, Pedantics, Programming, Python, Security, Unicode

A few weeks ago I released django-registration 2.4.1. The 2.4 series is the last in the django-registration 2.x line, and from here on out it’ll only get bugfixes. The master branch is now prepping for 3.0, which will remove a lot of the deprecated cruft that’s accumulated over the past decade of maintaining it, and try to focus on best practices for modern Django applications.

I’ll write more about that sometime soon, but right now I want to spend a little bit of time talking about a deceptively hard problem django-registration has to deal with: usernames. And while I could write this as one of those “falsehoods programmers believe about X” articles, my personal preference is to actually explain why this is trickier than people think, and offer some advice on how to deal with it, rather than just provide mockery with no useful context.

Aside: the right way to do identity

Usernames — as implemented by many sites and services, and by many popular frameworks (including Django) — are almost certainly not the right way to solve the problem they’re often used to solve. What we really want in terms of identifying users is some combination of:

System-level identifier, suitable for use as a target of foreign keys in our database
Login identifier, suitable for use in performing a credential check
Public identity, suitable for displaying to other users

Many systems ask the username to fulfill all three of these roles, which is probably wrong. A better approach is the tripartite identity pattern, in which each identifier is distinct, and multiple login and/or public identifiers may be associated with a single system identifier.

Many of the problems and pains I’ve seen with people trying to build and scale account systems have come down to ignoring this pattern. An unfortunate number of hacks have been built on top of systems which don’t have this pattern, in order to make them look or sort-of act as if they do.

So if you’re building an account system from scratch today in 2018, I would suggest reading up on this pattern and using it as the basis of your implementation. The flexibility it will give you in the future is worth a little bit of work, and one of these days someone might even build a good generic reusable implementation of it (I’ve certainly given thought to doing this for Django, and may still do it one day).

For the rest of this post, though, I’ll be assuming that you’re using a more common implementation where a unique username serves as at least a system and login identifier, and probably also a public identifier. And by “username” I mean essentially any string identifier; you may be using usernames in the sense that, say, Reddit or Hacker News do, or you might be using email addresses, or you might be using some other unique string. But no matter what, you’re probably using some kind of single unique string for this, and that means you need to be aware of some issues.

Uniqueness is harder than you think

You might be thinking to yourself, how hard can this be? We can just create a unique column and we’re good to go! Here, let’s make a user table in Postgres:

CREATE TABLE accounts (
    id SERIAL PRIMARY KEY,
    username TEXT UNIQUE,
    password TEXT,
    email_address TEXT
);

There’s our user table, there’s our unique username column. Easy!

Well, it’s easy until we start thinking about case. If you’re registered as john_doe, what happens if I register as JOHN_DOE? It’s a different username, but could I cause people to think I’m you? Could I get people to accept friend requests or share sensitive information with me because they don’t realize case matters to a computer?

This is a simple thing that a lot of systems get wrong. In researching for this post, I discovered Django’s auth system doesn’t enforce case-insensitive uniqueness of usernames, despite getting quite a lot of other things generally right in its implementation. There is a ticket for making usernames case-insensitive, but it’s WONTFIX now because making usernames case-insensitive would be a massive backwards-compatibility break and nobody’s sure whether or how we could actually do it. I’ll probably look at enforcing it in django-registration 3.0, but I’m not sure it’ll be possible to do even there — any site with existing case-sensitive accounts that bolts on a case-insensitive solution is asking for trouble.

So if you’re going to build a system from scratch today, you should be doing case-insensitive uniqueness checks on usernames from day one; john_doe, John_Doe, and JOHN_DOE should all be the same username in your system, and once one of them is registered, none of the others should be available.

But that’s just the start; we live in a Unicode world, and determining if two usernames are the same in a Unicode world is more complex than just doing username1 == username2. For one thing, there are composed and decomposed forms which are distinct when compared as sequences of Unicode code points, but render on-screen as visually identical to each other. So now you need to talk about normalization, pick a normalization form, and then normalize every username to your chosen form before you do any uniqueness checks.

You also need to be considering non-ASCII when thinking about how to do your case-insensitive checks. Is StraßburgJoe the same user as StrassburgJoe? What answer you get will often depend on whether you do your check by normalizing to lowercase or uppercase. And then there are the different ways of decomposing Unicode; you can and will get different results for many strings depending on whether you use canonical equivalence or compatibility.

If all this is confusing — and it is, even if you’re a Unicode geek! — my recommendation is to follow the advice of Unicode Technical Report 36 and normalize usernames using NFKC. If you’re using Django’s UserCreationForm or a subclass of it (django-registration uses subclasses of UserCreationForm), this is already done for you. If you’re using Python but not Django (or not using UserCreationForm), you can do this in one line using a helper from the standard library:

import unicodedata

username_normalized = unicodedata.normalize('NFKC', username)

On Python 3, strings also have the casefold() method, which is what you want to use instead of lower() or upper() for doing case normalization.

For other languages, look up a good Unicode library.

No, really, uniqueness is harder than you think

Unfortunately, that’s not the end of it. Case-insensitive uniqueness checks on normalized strings are a start, but won’t catch all the cases you probably need to catch. For example, consider the following username: jane_doe. Now consider another username: jаne_doe. Are these the same username?

In the tyepface I’m using as I write this, and in the typeface my blog uses, they appear to be. But to software, they’re very much not the same, and still aren’t the same after Unicode normalization and case-insensitive comparison (whether you go to upper- or lower-case doesn’t matter).

To see why, pay attention to the second code point. In one of the usernames above, it’s U+0061 LATIN SMALL LETTER A. But in the other, it’s U+0430 CYRILLIC SMALL LETTER A. And no amount of Unicode normalization or case insensitivity will make those be the same code point, even though they’re often visually indistinguishable.

This is the basis of the homograph attack, which first gained widespread notoriety in the context of internationalized domain names. And solving it requires a bit more work.

For network host names, one solution is to represent names in Punycode, which is designed to head off precisely this issue, and also provides a way to represent a non-ASCII name using only ASCII characters. Returning to our example usernames above, this makes the distinction between the two obvious. If you want to try it yourself, it’s a one-liner in Python. Here it is on the version which includes the Cyrillic ‘а’:

>>> 'jаne_doe'.encode('punycode')
b'jne_doe-2fg'

(if you have difficulty copy/pasting the non-ASCII character, you can also express it in a string literal as j\u0430ne_doe)

But this isn’t a real solution for usernames; sure, you could use Punycode representation whenever you display a name, but it will break display of a lot of perfectly legitimate non-ASCII names, and what you probably really want is to reject the above username during your signup process. How can you do that?

Well, this time we open our hymnals to Unicode Technical Report 39, and begin reading sections 4 and 5. Sets of code points which are distinct (even after normalization) but visually identical or at least confusingly similar when rendered for display are called (appropriately) “confusables”, and Unicode does provide mechanisms for detecting the presence of such code points.

The example username we’ve been looking at here is what Unicode terms a “mixed-script confusable”, and this is what we probably want to detect. In other words: an all-Latin username containing confusables is probably fine, and an all-Cyrillic username containing confusables is probably fine, but a username containing mostly Latin plus one Cyrillic code point which happens to be confusable with a Latin one… is not.

Unfortunately, Python doesn’t provide the necessary access to the full set of Unicode properties and tables in the standard library to be able to do this. But a helpful developer named Victor Felder has written a library which provides what we need, and released it under an open-source license. Using the confusable_homoglyphs library, we can detect the problem:

>>> from confusable_homoglyphs import confusables
>>> s1 = 'jane_doe'
>>> s2 = 'j\u0430ne_doe'
>>> bool(confusables.is_dangerous(s1))
False
>>> bool(confusables.is_dangerous(s2))
True

The actual output of is_dangerous(), for the second username, is a data structure containing detailed information about the potential problems, but what we care about is that it detects a mixed-script string containing code points which are confusable, and that’s what we want.

Django allows non-ASCII in usernames, but does not check for homograph problems. Since version 2.3, though, django-registration has had a dependency on confusable_homoglyphs, and has used its is_dangerous() function as part of the validation for usernames and email addresses. If you need to do user signups in Django (or generally in Python), and can’t or don’t want to use django-registration, I encourage you to make use of confusable_homoglyphs in the same way..

Have I mentioned that uniqueness is hard?

Once we’re dealing with Unicode confusables, it’s worth also asking whether we should deal with single-script confusables. For example, paypal and paypa1, which (depending on your choice of typeface) may be difficult to distinguish from one another. So far, everything I’ve suggested is good general-purpose advice, but this is starting to get into things which are specific to particular languages, scripts or geographic regions, and should only be done with care and with the potential tradeoffs in mind (forbidding confusable Latin characters may end up with a higher false-positive rate than you’d like, for example). But it is something worth thinking about. The same goes for usernames which are distinct but still very similar to each other; you can check this at the database level in a variety of ways — Postgres, for example, ships with support for Soundex and Metaphone, as well as Levenshtein distance and trigram fuzzy matching — but again it’s going to be something you do on a case-by-case basis, rather than just something you should generally always do.

There is one more uniqueness issue I want to mention, though, and it primarily affects email addresses, which often get used as usernames these days (especially in services which rely on a third-party identity provider and use OAuth or similar protocols). So assume you’ve got a case for enforcing uniqueness of email addresses. How many distinct email addresses are listed below?

johndoe@example.com
johndoe+yoursite@example.com
john.doe@example.com

The answer is “it depends”. Most MTAs have long ignored anything after a + in the local-part when determining recipient identity, which in turn has led to many people using text after a + as a sort of ad hoc tagging and filtering system. And Gmail famously ignores dot (.) characters in the local-part, including in their custom-domain offerings, so it’s impossible without doing DNS lookups to figure out whether someone’s mail provider actually thinks johndoe and john.doe are distinct.

So if you’re enforcing unique email addresses, or using email addresses as a uniqueness-enforced user identifier, you need to be aware of this and you probably need to strip all dot characters from the local-part, along with + and any text after it, before doing your uniqueness check. Currently django-registration doesn’t do this, but I have plans to add it in the 3.x series.

Also, for dealing with Unicode confusables in email addresses: apply that check to the local-part and the domain separately. People don’t always have control over the script used for the domain, and shouldn’t be punished for choosing something that causes the local-part to be in a single script distinct from the domain; as long as neither the local-part nor the domain, considered in isolation, are mixed-script confusable, the address is probably OK (and this is what django-registration’s validator does).

There are a lot of other concerns you can have about usernames which are too similar to each other to be considered “distinct”, but once you deal with case-insensitivity, normalization, and confusables, you start getting into diminishing-returns territory pretty quickly, especially since many rules start being language-, script-, or region-specific. That doesn’t mean you shouldn’t think about them, just that it’s difficult to give general-purpose advice.

So let’s switch things up a bit and consider a different category of problem.

You should have reservations about some names

Many sites use the username as more than just a field in the login form. Some will create a profile page for each user, and put the username in the URL. Some might create email addresses for each user. Some might create subdomains. So here are some questions:

If your site puts the username in the URL of the user’s profile page, what would happen if I created a user named login? If I were to populate my profile with the text “Our log-in page has moved, please click here to log in”, with a link to my credential-harvesting site, how many of your users do you think I could fool?
If your site creates email addresses from usernames, what happens if I sign up as a user named webmaster or postmaster? Will I get email directed to those addresses for your domain? Could I potentially obtain an SSL certificate for your domain with the right username and auto-created email address?
If your site creates subdomains from usernames, what happens if I sign up as a user named www? Or smtp or mail?

If you think these are just silly hypotheticals, well, some of them have actually happened. And not just once, but multiple times. No really, these things have happened multiple times.

You can — and should — be taking some precautions to ensure that, say, an auto-created subdomain for a user account doesn’t conflict with a pre-existing subdomain you’re actually using or that has a special meaning, or that auto-created email addresses can’t clash with important/pre-existing ones.

But to really be careful, you should probably also just disallow certain usernames from being registered. I first saw this suggestion — and a list of names to reserve, and the first two articles linked above — in this blog post by Geoffrey Thomas. Since version 2.1, django-registration has shipped a list of reserved names, and the list has grown with each release; it’s now around a hundred items.

The list in django-registration breaks names down into a few categories, which lets you compose subsets of them based on your needs (the default validator combines all of them, but lets you override with your own preferred set of reserved names):

Hostnames used for autodiscovery/autoconfig of some well-known services
Hostnames associated with common protocols
Email addresses used by certificate authorities to verify domain ownership
Email addresses listed in RFC 2142 that don’t appear in any other subset of reserved names
Common no-reply email addresses
Strings which match sensitive filenames (like cross-domain access policies)
A laundry list of other potentially-sensitive names like contact and login

The validator in django-registration will also reject any username which begins with .well-known, to protect anything which uses the RFC 5785 system for “well-known locations”.

As with confusables in usernames, I encourage you to copy from and improve on django-registration’s list, which in turn is based on and expanded from Geoffrey Thomas’ list.

It’s a start

The ideas above are not an exhaustive list of all the things you could or should do to validate usernames in sites and services you build, because if I started trying to write an exhaustive list, I’d be here forever. They are, though, a good baseline of things you can do, and I’d recommend you do most or all of them. And hopefully this has provided a good introduction to the lurking complexity of something as seemingly “simple” as user accounts with usernames.

As I’ve mentioned, Django and/or django-registration already do most of these, and the ones that they don’t are likely to be added at least to django-registration in 3.0; Django itself may not be able to adopt some of them soon, if ever, due to stronger backwards-compatibility concerns. All the code is open source (BSD license) and so you should feel free to copy, adapt or improve it.

And if there’s something important I’ve missed, please feel free to let me know about it; you can file a bug or pull request to django-registration on GitHub, or just get in touch with me directly.