Media moves

February 7, 2008 Django, Meta

As of sometime this weekend when the DNS change propagated, all static/media files for this site migrated over to Amazon’s S3 storage service; when I relaunched back in September, media was coming off a Joyent BingoDisk account. While a change in media serving technology really isn’t all that sexy or exciting compared to some aspects of web development (or as popular as writing about whatever Microsoft is doing today), I’d like to take a moment to talk about this move, why I decided to go with S3, and point out a few tricks which, hopefully, might be useful to you at some point.

Refresher: separate media servers are good

Officially, the Django documentation recommends that, in production deployment situations, you have one web server handling requests which will be processed by Django and another handling requests for “media”: static files like images, stylesheets, JavaScript, etc. This doesn’t mean you need to have two physically separate computers, just that you should have two separate HTTP daemons running, with one serving Django and one serving your media.

This is mostly because the preferred deployment situation for Django over its history has been Apache and mod_python, which means that each Apache process is carrying around an embedded Python interpreter, plus an in-memory copy of Django, your applications and any other Python code you happen to be using. In that situation, taking one of those processes and using it to simply read a file off disk and shove it down the wire is an incredible waste; you’re using up valuable system resources to get Django running inside your web server, so you want those processes to be actually doing things with Django as often as possible.

Of course, under alternative deployment setups such as FastCGI or mod_wsgi (which is likely to become the “winner” in terms of deploying Python applications behind Apache) this isn’t necessarily a problem: under FastCGI, or under mod_wsgi in daemon mode, all that code that used to be embedded inside Apache is instead running in a separate external process. In this case the hit isn’t so bad, though it’s still present: you’re still tying up a process which could be talking to your external Python/Django processes.

And even if that doesn’t bother you too much, it’s still worth noting that, typically, the types of web server configurations which give the best performance for Django are not necessarily the same as the configurations which give the best performance for straight-up file serving; keeping the two separate means they can be tuned separately for the best possible results.

On BingoDisk

But regardless of how sensible it is or how much of an impact it’ll have on performance, the recommendation to use a separate media server has always been a bit of a stumbling block for first-time Django deployment. When I was setting this site up on its new server, I knew I didn’t want to deal with configuring and maintaining a separate media server, so I started shopping around for a good file-hosting service. Given the well-known fact that I am Joyent‘s bitch, I naturally gravitated first to their offering: BingoDisk, which seemed like the ideal option.

And I liked it at first; Bingo had a couple of “quick win” features which initially won me over:

It supports pointing any domain you like at your BingoDisk account via a DNS CNAME record; I set up media.b-list.org as a CNAME to my BingoDisk account, and everything just worked.
You don’t pay for bandwidth on Bingo, only for storage (generally speaking, bandwidth is cheap these days, but quality storage isn’t and probably never will be).
The upload/download interface is WebDAV, which is supported out of the box in most operating systems and file managers these days.

For the most part I’ve been happy with Bingo (even with the recent downtime; I was able to quickly set up an alternate media server to keep things chugging along while that was happening). But there are a few things that I kept tripping over which ultimately led me to migrate away:

Everything on your Bingo account is locked away behind HTTP authentication, except for one automatically-created folder called public. So anything you want to serve to the general public has to go in there, and even that arrangement sometimes causes HTTP login boxes to appear where they aren’t wanted (largely because browsers want to request /favicon.ico which, if you even create it, would live in the non-public area).
WebDAV starts out looking neat, and quickly becomes a pain. It’s an extremely chatty protocol, and though it’s supported by modern file-management tools it’s often unbearably slow; getting a useful listing (including file descriptions and metadata) from a directory with a lot of files can take several minutes in OS X’s Finder, for example.
Bingo — so far as I can tell — doesn’t really offer much control over how your files are served, which can impair some useful optimization tricks.

I still like the idea of Bingo, especially as an easy way for non-technical people to easily share files amongst each other or just between multiple computers (which, to be fair, is really what it’s best suited for), but a little while back I decided it was time to look elsewhere for a media-serving solution.

On S3

The solution I settled on, in the end, was S3. The factors which won me over, in the end, were:

The CNAME trick works just as well on S3 as it does on Bingo (and, in fact, I think “point a CNAME at us” is a must-have feature for web-based services these days), though it actually points to an S3 “bucket” (essentially a named container into which you can put things) of the same name as the domain you’re pointing at it.
The pricing policy is about the best you’ll find: instead of a flat subscription rate, you only pay for what you’re actually using (and quite reasonable prices, too).
Access control is fine-grained, and it’s easy to make everything public.
It’s extremely easy to specify how you want the file to be served, and the REST API does a lot of things with nothing but standard HTTP headers (allowing you to set things like Expires, Content-Type, Content-Encoding, etc.).

The downside is that all interaction with the service is through an HTTP-based API, so you need to write a little code to automate the process. Fortunately, Amazon provides a Python library which speaks S3’s REST API, and from there it’s a simple matter to write an uploader script; Adrian has posted the one he used to upload chicagocrime.org’s media, if you’re looking for an example.

I ended up writing my own upload script (like weblog applications, I have a feeling that S3 uploaders are a sort of “rite of passage” script that every web developer will want to to write in a slightly different way), and ended up with a pretty simple process: my script takes a filename, optional key prefix (S3 has no concept of directories, but you can specify a “key” for a file which looks like a directory path + filename) and optional bucket name (defaulting to the one for this site’s media), and I’ve got local copies of everything arranged correctly, so I could drop into a particular directory — say, img — on my laptop, and run a quick bash one-liner (you do know your way around a Unix shell, right? That’s one of the most underappreciated skills a web developer can have):

for filename in `ls *.*`; do s3_upload -f $filename -k img/; done

And voilà: all my images upload to S3.

Tips and tricks

While I was migrating, I decided to take advantage of the opportunity to do a little performance tuning, with some help from YSlow; it had been giving this site a “B” grade for months, and a lot of its complaints (OK, all of its complaints) were related to the way static files were being served.

One thing YSlow had been griping at me about for a while was the lack of an HTTP Expires header on some files. S3 allows you to send an Expires header along with an upload through the REST API, so I simply built that into my uploader: everything I upload to S3 gets an Expires date set two years in the future. So, for example, if you do a curl -I on my stylesheet, you’ll see something like Expires: Thu, 15 Apr 2010 20:00:00 GMT come back in the response headers.

If this seems like a bad idea — stylesheets have a tendency to change more often than every two years, after all — take a look at the URL of the stylesheet: the filename is b-list-20080206.css. It was preceded by b-list-20080203.css, which was preceded by b-list-20071218.css, and so on; each revision of the stylesheet is actually a unique file with a datestamped filename. This strikes a nice balance between being able to specify a date far in the future for the file to expire from caches (not that any browser actually would cache a stylesheet for two years) and having an easy way to force an update for browsers which exhibit strange caching behavior (I’m looking at you, Safari; if any browser ever does respect that two-year Expires header, it’ll be Safari). The logo up in the header and a few other stylistic images work the same way.

I don’t honestly remember where I first saw that trick, but I’ve used it off and on now for a while and highly recommend it or a variation; my favorite is probably Ryan Tomayko‘s implementation, which adds a querystring to his stylesheet’s URL, containing the Unix timestamp of the last change (which I assume is generated by checking the mtime of the file, since it matches up with the Last-Modified header his server is sending out).

Another useful thing YSlow kindly reminded me about is compressed file serving; all of my Django-generated content is already being gzipped server-side, and adding that to my stylesheet (the only static file I serve where compression makes sense) was something I’d initially overlooked. Serving gzipped files with S3 is pretty easy, as it turns out; you just gzip the file contents before uploading, and send along a Content-Encoding header with a value of gzip.

Python offers two ways to gzip something programmatically: a low-level interface in the “zlib” module, and a higher-level file-like interface in the “gzip” module. If you’re unsure which way to go, you probably want to use the gzip module; the interface is a lot easier to work with. And if you’re lazy like me, you can copy/paste this seven-line function from django.utils.text, which is used by Django’s GZipMiddleware:

def compress_string(s):
    import cStringIO, gzip
    zbuf = cStringIO.StringIO()
    zfile = gzip.GzipFile(mode='wb', compresslevel=6, fileobj=zbuf)
    zfile.write(s)
    zfile.close()
    return zbuf.getvalue()

Since the stylesheet is the only thing I want to compress, I ended up special-casing this in my S3 uploader; like Adrian, I’m using mimetypes.guess_type() to work out the correct Content-Type for each uploaded file, so I just have my uploader watch for a value of text/css, which it will take as a cue to compress the file and add a Content-Encoding header.

And now (with a little tweaking of a preference value so it won’t complain that I’m not using a CDN; see the YSlow documentation for details), thanks to those fairly-minor tweaks, YSlow gives me an “A” grade; the only thing it still complains about is the non-gzipped, non-ETagged, lacking-an-Expires-header JavaScript include from the stats tracker I use, and there’s not much I can do about that (Google Analytics passes YSlow with flying colors, but doesn’t offer nice real-time stats). There’s even a nice psychological illusion that the site feels faster.

The future: file uploads

The only thing missing from S3 right now is actually something missing from Django: the ability to have file uploads go from Django straight to S3, instead of requiring them to be stored locally (right now, your options for storing files on a different machine basically consist of NFS, NFS, and… well, NFS).

Fortunately, Marty has been working steadily on a plan to allow Django to use pluggable file-storage backends, much as it already allows pluggable authentication and session backends, and the logical first use for that would be an S3 backend. As soon as that lands in trunk, S3 (and services like it) will constitute an easy, affordable and full-featured option for Django media serving.