Unicode in the real world

Published: November 10, 2007. Filed under: Django.

Django’s Unicode branch merged back over the summer, but I imagine there are a lot of people who are still using pre-Unicode versions of Django and so haven’t started updating their code yet (it was several months before I had fully updated this site, for example). For the most part, updating to work with post-Unicode Django is pretty easy (and made even easier by Malcolm’s handy checklist), but there are still some things you’ll need to be careful about; one in particular is how to handle external APIs you need to work with, which aren’t guaranteed to support or work properly with Django’s all-Unicode internals.

Unicode and bytestrings

First off, it’s import to understand the difference between the two types of strings Python currently uses (this distinction will be going away in Python 3.0, where the current unicode type will become the only string type in Python): bytestrings, which are of type str, and Unicode strings which are of type unicode. Both inherit from a common parent, basestring, which gives you an easy way to identify things that are strings, regardless of which type of string they are.

Python bytestrings, as the name implies, correspond to a series of bytes which, in some particular encoding, represent a sequence of characters. The default most people will get is ASCII, an encoding which handles exactly 128 (English) characters, but you can create Python bytestrings in all sorts of encodings: ISO-8859-1 for a broader set of Western European characters, KOI-8 for Russian, GB-2312 for Chinese, etc., and Python bytestrings can be encoded or decoded between compatible encodings with some ease.

Unicode strings, on the other hand, correspond to sequences of Unicode characters, and — contrary to popular belief — Unicode is not an encoding; sequences of Unicode characters, by themselves, don’t translate into any particular set of bytes you could send over a network connection, for example. In order to pass Unicode text out of your programs (and, potentially, into other programs or services), you need to encode it into a series of bytes, using a “Unicode transformation format” such as UTF-8. One of the best introductions to this sort of thing for programers is Joel Spolsky’s “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)”, which I highly recommend you read if you’re not that familiar with how Unicode works in programming languages.

Internally, Django works with Unicode strings; data coming in from HTTP requests, databases or files will be converted to unicode before you ever see it, and data going out will be converted to an appropriate encoding, so that you largely don’t have to worry about character-encoding issues. But this is something you need to be aware of and, since your application may be doing other types of input or output or working with other software, you’ll need to know how to work around things that absolutely need one particular type of string.

Working with third-party software

Unfortunately, not all the Python software you’ll encounter in the real world can deal gracefully with Unicode strings (some can’t even deal gracefully with non-ASCII bytestrings, but that’s another matter). So you’ll sometimes need to manually convert the Unicode strings Django deals with into bytestrings before you pass them into functions from third-party Python modules you’re using.

A good example of this is the Python port of the Textile text-to-HTML converter, which (as of the time I’m writing this) really doesn’t like being given Unicode strings. It will, however, deal well with UTF-8-encoded bytestrings, which means that if you convert to UTF-8 you’ll be fine. The “textile” template filter in django.contrib.markup does this in a pretty robust way, and in doing so demonstrates the use of a couple of handy utility functions Django gives you for converting between bytestrings and Unicode strings:

def textile(value):
    try:
        import textile
    except ImportError:
        if settings.DEBUG:
            raise template.TemplateSyntaxError, "Error in {% textile %} filter: The Python textile library isn't installed."
        return force_unicode(value)
    else:
        return force_unicode(textile.textile(smart_str(value), encoding='utf-8', output='utf-8'))

The smart_str() and force_unicode() functions both come from the module django.utils.encoding (covered in Django’s Unicode documentation), and do pretty much what you’d expect: smart_str() converts its input to a bytestring, and force_unicode() converts its input to a Unicode string. What the Textile filter does is use smart_str() to create a UTF-8-encoded bytestring which is fed into Textile, then use force_unicode() on the result to get it back to a Unicode string again.

In general, this is a useful pattern for dealing with third-party software that doesn’t handle Unicode gracefully.

Using web-based APIs

Regardless of the software being used to implement them, web-based APIs need to work with bytestrings: since Unicode is not an encoding, and doesn’t translate naturally into a series of bytes to be sent over the wire, you can’t just fire off an HTTP request to an API and hand it Unicode strings. So it’s a good idea to convert to a bytestring before calling any function which talks to a web-based API; the function may or may not already be smart enough to convert to a bytestring, so doing the conversion yourself ensures things will work.

One example I’ve frequently worked with is the Python interface to the Akismet spam-filtering service, which I use (via support for it built into comment_utils) to fight comment spam here. Since an incoming comment will be a Unicode string, comment_utils converts to a bytestring before handing off to Akismet:

if self.akismet:
    from akismet import Akismet
    from django.utils.encoding import smart_str
    akismet_api = Akismet(key=settings.AKISMET_API_KEY,
                          blog_url='http://%s/' % Site.objects.get_current().domain)
    if akismet_api.verify_key():
        akismet_data = { 'comment_type': 'comment',
                         'referrer': '',
                         'user_ip': comment.ip_address,
                         'user_agent': '' }
        if akismet_api.comment_check(smart_str(comment.comment), data=akismet_data, build_data=True):
            return True

This uses smart_str() to get a UTF-8 bytestring suitable for the HTTP POST to the Akismet service and — since we’re expecting to get back a boolean from the API and don’t care about reading strings back from it — doesn’t need to worry about converting the response back to Unicode.

Another example is posting to del.icio.us, using pydelicious; every link I post here is also submitted to my del.icio.us account via this API and, again, I just use smart_str() to convert the various values (three of them in this case: the link’s title, its URL and a list of tags).

Other useful bits

Django’s Unicode documentation covers Unicode handling in detail, and provides full information on the utility functions above, as well as some other useful functions (like iri_to_uri(), useful for generating URLs with escaped non-ASCII characters). If you haven’t already, you should give it a read.