As of sometime this weekend when the DNS change propagated, all static/media files for this site migrated over to Amazon’s S3 storage service; when I relaunched back in September, media was coming off a Joyent BingoDisk account. While a change in media serving technology really isn’t all that sexy or exciting compared to some aspects of web development (or as popular as writing about whatever Microsoft is doing today), I’d like to take a moment to talk about this move, why I decided to go with S3, and point out a few tricks which, hopefully, might be useful to you at some point.
Officially, the Django documentation recommends that, in production deployment situations, you have one web server handling requests which will be processed by Django and another handling requests for “media”: static files like images, stylesheets, JavaScript, etc. This doesn’t mean you need to have two physically separate computers, just that you should have two separate HTTP daemons running, with one serving Django and one serving your media.
This is mostly because the preferred deployment situation for Django over its history has been Apache and mod_python, which means that each Apache process is carrying around an embedded Python interpreter, plus an in-memory copy of Django, your applications and any other Python code you happen to be using. In that situation, taking one of those processes and using it to simply read a file off disk and shove it down the wire is an incredible waste; you’re using up valuable system resources to get Django running inside your web server, so you want those processes to be actually doing things with Django as often as possible.
Of course, under alternative deployment setups such as FastCGI or mod_wsgi (which is likely to become the “winner” in terms of deploying Python applications behind Apache) this isn’t necessarily a problem: under FastCGI, or under mod_wsgi in daemon mode, all that code that used to be embedded inside Apache is instead running in a separate external process. In this case the hit isn’t so bad, though it’s still present: you’re still tying up a process which could be talking to your external Python/Django processes.
And even if that doesn’t bother you too much, it’s still worth noting that, typically, the types of web server configurations which give the best performance for Django are not necessarily the same as the configurations which give the best performance for straight-up file serving; keeping the two separate means they can be tuned separately for the best possible results.
But regardless of how sensible it is or how much of an impact it’ll have on performance, the recommendation to use a separate media server has always been a bit of a stumbling block for first-time Django deployment. When I was setting this site up on its new server, I knew I didn’t want to deal with configuring and maintaining a separate media server, so I started shopping around for a good file-hosting service. Given the well-known fact that I am Joyent‘s bitch, I naturally gravitated first to their offering: BingoDisk, which seemed like the ideal option.
And I liked it at first; Bingo had a couple of “quick win” features which initially won me over:
media.b-list.org as a CNAME to my BingoDisk account, and everything just worked.
For the most part I’ve been happy with Bingo (even with the recent downtime; I was able to quickly set up an alternate media server to keep things chugging along while that was happening). But there are a few things that I kept tripping over which ultimately led me to migrate away:
public. So anything you want to serve to the general public has to go in there, and even that arrangement sometimes causes HTTP login boxes to appear where they aren’t wanted (largely because browsers want to request /favicon.ico which, if you even create it, would live in the non-public area).
I still like the idea of Bingo, especially as an easy way for non-technical people to easily share files amongst each other or just between multiple computers (which, to be fair, is really what it’s best suited for), but a little while back I decided it was time to look elsewhere for a media-serving solution.
The solution I settled on, in the end, was S3. The factors which won me over, in the end, were:
CNAME trick works just as well on S3 as it does on Bingo (and, in fact, I think “point a CNAME at us” is a must-have feature for web-based services these days), though it actually points to an S3 “bucket” (essentially a named container into which you can put things) of the same name as the domain you’re pointing at it.
Expires, Content-Type, Content-Encoding, etc.).
The downside is that all interaction with the service is through an HTTP-based API, so you need to write a little code to automate the process. Fortunately, Amazon provides a Python library which speaks S3’s REST API, and from there it’s a simple matter to write an uploader script; Adrian has posted the one he used to upload chicagocrime.org’s media, if you’re looking for an example.
I ended up writing my own upload script (like weblog applications, I have a feeling that S3 uploaders are a sort of “rite of passage” script that every web developer will want to to write in a slightly different way), and ended up with a pretty simple process: my script takes a filename, optional key prefix (S3 has no concept of directories, but you can specify a “key” for a file which looks like a directory path + filename) and optional bucket name (defaulting to the one for this site’s media), and I’ve got local copies of everything arranged correctly, so I could drop into a particular directory — say, img — on my laptop, and run a quick bash one-liner (you do know your way around a Unix shell, right? That’s one of the most underappreciated skills a web developer can have):
for filename in `ls *.*`; do s3_upload -f $filename -k img/; done
And voilà: all my images upload to S3.
While I was migrating, I decided to take advantage of the opportunity to do a little performance tuning, with some help from YSlow; it had been giving this site a “B” grade for months, and a lot of its complaints (OK, all of its complaints) were related to the way static files were being served.
One thing YSlow had been griping at me about for a while was the lack of an HTTP Expires header on some files. S3 allows you to send an Expires header along with an upload through the REST API, so I simply built that into my uploader: everything I upload to S3 gets an Expires date set two years in the future. So, for example, if you do a curl -I on my stylesheet, you’ll see something like Expires: Thu, 15 Apr 2010 20:00:00 GMT come back in the response headers.
If this seems like a bad idea — stylesheets have a tendency to change more often than every two years, after all — take a look at the URL of the stylesheet: the filename is b-list-20080206.css. It was preceded by b-list-20080203.css, which was preceded by b-list-20071218.css, and so on; each revision of the stylesheet is actually a unique file with a datestamped filename. This strikes a nice balance between being able to specify a date far in the future for the file to expire from caches (not that any browser actually would cache a stylesheet for two years) and having an easy way to force an update for browsers which exhibit strange caching behavior (I’m looking at you, Safari; if any browser ever does respect that two-year Expires header, it’ll be Safari). The logo up in the header and a few other stylistic images work the same way.
I don’t honestly remember where I first saw that trick, but I’ve used it off and on now for a while and highly recommend it or a variation; my favorite is probably Ryan Tomayko‘s implementation, which adds a querystring to his stylesheet’s URL, containing the Unix timestamp of the last change (which I assume is generated by checking the mtime of the file, since it matches up with the Last-Modified header his server is sending out).
Another useful thing YSlow kindly reminded me about is compressed file serving; all of my Django-generated content is already being gzipped server-side, and adding that to my stylesheet (the only static file I serve where compression makes sense) was something I’d initially overlooked. Serving gzipped files with S3 is pretty easy, as it turns out; you just gzip the file contents before uploading, and send along a Content-Encoding header with a value of gzip.
Python offers two ways to gzip something programmatically: a low-level interface in the “zlib” module, and a higher-level file-like interface in the “gzip” module. If you’re unsure which way to go, you probably want to use the gzip module; the interface is a lot easier to work with. And if you’re lazy like me, you can copy/paste this seven-line function from django.utils.text, which is used by Django’s GZipMiddleware:
def compress_string(s): import cStringIO, gzip zbuf = cStringIO.StringIO() zfile = gzip.GzipFile(mode='wb', compresslevel=6, fileobj=zbuf) zfile.write(s) zfile.close() return zbuf.getvalue()
Since the stylesheet is the only thing I want to compress, I ended up special-casing this in my S3 uploader; like Adrian, I’m using mimetypes.guess_type() to work out the correct Content-Type for each uploaded file, so I just have my uploader watch for a value of text/css, which it will take as a cue to compress the file and add a Content-Encoding header.
And now (with a little tweaking of a preference value so it won’t complain that I’m not using a CDN; see the YSlow documentation for details), thanks to those fairly-minor tweaks, YSlow gives me an “A” grade; the only thing it still complains about is the non-gzipped, non-ETagged, lacking-an-Expires-header JavaScript include from the stats tracker I use, and there’s not much I can do about that (Google Analytics passes YSlow with flying colors, but doesn’t offer nice real-time stats). There’s even a nice psychological illusion that the site feels faster.
The only thing missing from S3 right now is actually something missing from Django: the ability to have file uploads go from Django straight to S3, instead of requiring them to be stored locally (right now, your options for storing files on a different machine basically consist of NFS, NFS, and… well, NFS).
Fortunately, Marty has been working steadily on a plan to allow Django to use pluggable file-storage backends, much as it already allows pluggable authentication and session backends, and the logical first use for that would be an S3 backend. As soon as that lands in trunk, S3 (and services like it) will constitute an easy, affordable and full-featured option for Django media serving.
Comments for this entry are closed. If you'd like to share your thoughts on this entry with me, please contact me directly.
Hmm. Surely your CSS approach prevents you from applying new styling to old content from the perspective of people who are seing cache content? Not that it’s necessarily a big deal (who cares about the chrome) but it’s slightly irksome.
The actual HTML content doesn’t have an extreme
Expiresheader, so no worries on that front; also, typically it’s stylesheets that are the most problematic things in terms of browsers caching longer than you want them to, not HTML.Dave, you usually don’t set Expires headers on HTML, just on CSS, Javascript and images, so the HTML is always retrieved so if you update CSS, you also need to update HTML.
Would using something along the lines of s3fuse work for the problem of having files go directly to s3? Once mounted, the s3fuse mount would appear as a local storage to Django and any copies get reflected in S3.
I haven’t thought through all pros, cons, and logistics of this, but at first glance this might work and could possibly tide some people over until Marty’s pluggable backend system hits trunk.
I am not sure if you also considered Transmit as a means of uploading to S3. If you don’t already have it then it might not be worth purchasing Transmit, but if you do have needs for an excellent FTP app with S3 support, you might want to give it a shot.
You do use OS X, right? ;)
@brosner: Can Trasmit set the various bits that James is talking about, like 2-year out Expires or the gzipped Content-Encoding?
I wrote an S3 upload utility using boto (http://code.google.com/p/boto/), but it lacks these features, or a feature to sync a folder with what is up on S3.
Nice post, James. I didn’t realize the expires and gzipping features existed on S3.
The only concern I have for S3 billing models is someone “digg”ing an image off your website with a direct link. I think it’s more on implementation “Here’s the 50 bucks I’m willing to spend, break when I quit paying you”.
“you do know your way around a Unix shell, right?”
Yeah, and if you’d really known your way around the shell, you’d known that it automatically expands filename patterns for you, and designed your upload script to accept multiple filenames:
(I’d probably add a “.s3” file to the local root directory, and teach the uploader to look upwards until it finds the root, so I can just do
and have the tool figure out the -k setting all by itself…)
There was supposed to be a smiley in the first part of the previous post, btw ;-)
Adrian’s script does multiple filenames, I believe (and is designed to get piped input from
find), but I initially needed mine to do one file at a time and then was too lazy to build in the logic to figure out how many filenames were passed in.Nice post. I didn’t realize the S3 support for GZipped content. Does that work with non-GZip supporting browsers (the few that remain)? Does Amazon’s servers automatically un-GZip it for those?
@Chris: Amazon has already addressed your concern.
So, you’re worried about someone linking to an image you put on S3 rather than to your site. Well, S3 supports ACLs (access control lists). So, you basically tell Amazon, no one can see this image unless my site says it’s ok. How does that work? Well, someone requests the_cool_image.php or whatever. Your app then says, OK, generate the URL for that asset and include a hash code that authenticates the reader and only works for the next 60 seconds or whatever time period.
The S3 libraries handle this, but basically the link becomes:
s3.aws.com/bucket/image.jpg?time=xxx&key=yyy and the key is a hash that is created based on your secret key. If someone changes the time, it doesn’t hash right. If the time has expired, it’s an old link.
I’ve really bastardized the way that it works, but suffice it to say, your app can generate time-sensitive URLs for an image and since the browser should be requesting the image within a second or so of the page loading, if you have it time out a minute or so later, you’re golden. No bandwidth hijacking and still access for all your users.
And tommorow S3 decides that they don’t have a good model… And dump their thing. You have all your sites to convert then.
Yes, and your non-Amazon hosting company might go out of business tomorrow. Or the colo for your dedicated server. Or a backbone provider might de-peer them. Or… well, a lot of things up to and including asteroids striking the Earth. If you don’t trust a company you deal with to be in business tomorrow, or you don’t think it’s a worthwhile risk to take, then don’t. But have fun building your own parallel infrastructure without trusting any existing providers.
Like Nolan I kind of like the idea of mounting the s3 space using fuse.
I like the elegance of moving the problem of uploading to different hosts to the system level rather than the web framework level. With a mounted s3 disc I can use the space without changing anything in Django except for the media path.
I don’t know if s3fuse lets you specify the upload headers to use (like gzip), but I imagine it’s not an unrealistic possibility.
Brian, I do use OS X, but I don’t use Transmit; I’ve already got a perfectly good (and portable) file-transfer client that I know and love.
another option for uploads would be some type of FUSE interface to s3, then you just mount the directory via fuse, and let your django write to it (blissfully ignorant about what is happening at the lower level).
I like the idea of timestamped file names. Are you hand-editing your templates to update them to the new file name every time you make a css change? If so, this seems like a headache for a complex site that has a lot of different compartmentalized css files (some of which might be shared amongst several base_*.html files, but not in base.html). Am I missing the obvious solution to keep things DRY? (I’m assuming the answer is ‘yes’, but please enlighten me)
I continue to swear by nginx for serving static files - stick it in front of a mod_python / mod_wsgi Apache and it’ll handle static media just fine. Added bonus: it can serve content directly from memcached (falling back to proxy through to your Django app on a cache miss) which enables a bunch of really interesting caching tricks.
James,
Do you install lftp through MacPorts or by hand? If I install it manually it compiles fine and everything but there are some 100% CPU issues among other things. I would like to use it on OS X and preferably compile it without MacPorts (I love vanilla manual compilation where I am in control of everything). Is there some patch for it under Leopard?
Excellent point, eliot! And very original too… ;-)
Šime, I compiled it myself. See here for how I did it — there is indeed a bug involving high CPU usage that you’ll need to work around.
I recommend taking a look at the excellent [boto]{http://code.google.com/p/boto/} library. It provides a Python API for SQS, EC2 and S3.
If you’re dealing with a program that only works on one file at a time, ls | xargs is a lot less typing than a for loop. :)
Specifically,
ls | xargs -n 1 s3_upload -k img/ -f
For some reason I’ve never felt comfortable with
xargs. Dunno why, it just always feels wrong ;)At least on linux only one python interpreter should reside in memory (theoretically). What I remember from operating systems class is that linux copies memory pages on write only (other OSes should have copy on write too). so (in theory) and if one server process forks from the other they should share all pages that are never modified, which, in case of the python interpreter should be close to all. memory consumption as shown in ‘top’ for instance does not recognize this circumstance as it appears to the processes as if each of them got the whole lot. It still can be good idea to have a very lightweight webserver to serve media, but i think the memory usage argument could be wrong. Could someone with deeper insight to the linux kernel please step in? I for one have no idea how proof my point at the moment.
cheers tback
tback, the point is not that the server uses more memory when it handles both Django and media; the point is that a single server process (or thread, in a multithreaded setup) can only handle one request at a time. If it’s a process that could be routing to Django for a dynamic request but isn’t because it’s serving a file off disk instead, then that process/thread represents a scarce resource (you can only run so many processes/threads at once on a given machine) which isn’t being utilized efficiently.
sorry james, i missed that