Server startup” and its discontents

Published November 5, 2007. Filed under: Django.

A fairly common question people ask on the Django mailing lists and in the IRC channel is “how do I have something run when the server starts up?” This is a somewhat tricky question to answer, because in many ways there’s no concept of “server start” in Django, and because the reasons behind the question vary from one use case to the next. So let’s take a look at just what people are wanting to do at “server start”, what that means (and, just as importantly, what it doesn’t mean) and how to handle it in a Django-based application.

The protocol without a state

Every so often someone will complain about frameworks like Django or Rails, alleging that they’re “just” light wrappers around HTTP (for example) and implying that this is a bug which needs to be fixed. Personally, I think this overlooks the power of HTTP, and that staying close to the nature of the underlying protocol is really a strength. But it does require a different way of thinking about things.

HTTP, as everyone hopefully knows, is a stateless protocol; this means that HTTP, by definition, has no “memory” of what’s gone on in previous requests, and won’t “remember” the current request later on. The earliest standard for web programming — CGI — emulated this by being designed in such a way that a new, separate copy of the application was launched for every request and shut down after every response, which meant there was no possible way for the application to persist (unaided) between requests and maintain some notion of what state it was in or what it had previously done.

Of course, there are some ways to maintain application state in HTTP; cookies, for example, let you serialize information out to a client (typically a web browser, though not always) and get that information back, which lets you do some “where was I when we last left off” calculations. But the existence of a limited loophole doesn’t really change the statelessness of HTTP as a whole; the cookie information still has to be retransmitted with every single request, because the server isn’t going to remember it.

Django can be run under CGI, but it’s horribly inefficient to do so; instead, some mechanism (mod_python, mod_wsgi, FastCGI, etc.) is typically used to load the code and keep it resident in memory over the lifetime of a server process, which means that, just as with cookies at the HTTP level, there are some loopholes in the “no persistent state” rule. This still means you get a completely separate copy of the application in every server process, though, so that there’s still no concept of a global “application state”, and Django deliberately initializes certain resources (database connections, for example) at the start of a request and releases them when the response has been sent out, which means that the opportunities for persistent state are somewhat limited.

So the first thing to do is let go of the notion of “run this code when the server starts”, and instead start thinking in terms closer to what’s actually happening. For that you have two choices:

  1. If you’re not running under CGI (and you really don’t want to run Django under plain CGI) you can work in terms of server processes, and accept that each server process is completely independent of the others and will have its own copy of any code and data you load.
  2. You can work purely in terms of request/response cycles, which is (mostly) what Django does.

Use cases

Broadly speaking, there are three classes of use cases where people end up asking how to do something at “server start” or to otherwise have something happen once and then persist across request/response cycles:

  1. Initialization routines which have to happen before an application can begin processing requests properly.
  2. Large or expensive data sets which need to be loaded once and then made accessible from each request.
  3. External resources, such as database connections, which need to be set up and then preserved between requests for performance reasons.

And (again, broadly speaking) there are two major ways of dealing with these sorts of cases:

  1. Execute something when a relevant module is first imported, and have it set up global state for the module which will persist over the life of the server process (if you don’t already understand how Python’s import process works, you really should go learn about it).
  2. Use an external service which does have some concept of state, or at least has a persistence mechanism, to maintain needed data or resources and provide them to your application on demand.

So let’s look at some examples and see how to put these ideas to use.

Maintaining stable registries

In order to make its ORM work, Django needs to know which applications are installed in your project and what models those applications provide. The list of applications comes from your INSTALLED_APPS setting, but looping through the applications, reading their model modules and instantiating the necessary model classes can be a somewhat expensive operation; there may be a lot of imports and a lot of files to read, and the process of setting up a Django model class can be fairly complex. If you’ve ever noticed a slight lag between the moment you type python manage.py shell and the moment you actually get a Python prompt, you’ll know what I mean.

We’ve already seen a hint of how Django deals with this, when we looked at the model-loading machinery:

all of the functions listed above are actually implemented as methods on a class called AppCache, which serves as a registry of your project’s available models

The module where the model-loading code lives — django.db.models.loading — creates a module-global instance of AppCache when it’s first imported, and sets up get_model() and friends as aliases to methods of that instance. This instance of AppCache will persist for as long as the module stays in memory (so, for the lifetime of a server process), which means that all of the model loading only has to be done once per process. Of course, in a CGI setup where everything is unloaded as soon as the response goes out, this wouldn’t help any, but in a normal use case under a server setup where the code stays resident in memory for an extended time it’ll yield a pretty good performance boost.

Django’s newforms-admin branch, which will eventually replace the admin application, uses a similar trick: in the default setup, you’ll access a single (per-process) instance of a class called AdminSite, and register models with it so the admin will know which models it needs to handle. Two of my own apps also use this technique:

Handling initialization

The technique of a module-global variable works for more than just persistent state; if your application needs to run some code once to set itself up, it can make use of a module-global flag to figure out whether that’s happened yet. For example, you might have some code which looks like this:

setup_has_run = False

def get_config_value(value):
    global setup_has_run
    if not setup_has_run:
        run_setup()
        setup_has_run = True
    # ...get and return the configuration value...

Django uses a similar trick to avoid constantly re-importing your settings (and to avoid, in some cases, the need to specify a settings module until the settings are actually needed): django.conf exports the variable settings, which is an instance of a class called LazySettings; when you try to access an attribute of it, it checks to see if it’s loaded your settings module yet, does so only if needed, and then keeps the loaded settings in an attribute of the LazySettings instance. Since Django has to do a bit of work to actually set up your settings (including filling in default values for a lot of settings if you don’t specify them), this generaly leads to improved performance.

Expensive data

Another common case where people want to have something run on “server start” is a situation where there’s some chunk of data that’s expensive to calculate or load (possibly because there’s a lot of data), but which will be needed in the course of serving requests; one example I’ve actually seen was an online spell-checking application which needed to load a large dictionary of words, common typos and pre-calculated Levenshtein distances.

Depending on the nature of the data you need to have access to, using Django’s cache framework to load and store it can be an effective option; you can check the cache on a per-request or per-process basis to see if ithe data has already been loaded, then read it out if it has or load it and stick it into the cache (with a suitably large timeout) if it hasn’t.

Of course, this still may not be good enough; some things are simply too expensive to keep generating or reading into/out of a cache, and for those cases you really need to turn to an external service. For example, most search systems which have Python bindings will need to have specialized index structures available, and it’s almost always more effective to either have an external service generating index files to be read off disk when needed, or to have the whole search system run as a a service and maintain its own state (Grassy Knoll is an example of this; it runs as a web service, and exposes a REST interface you can send requests to).

Maintaining resources

The big use case here is keeping database connections alive between request/response cycles; on some database systems, obtaining a connection is expensive enough that it can drag down a high-traffic site’s performance. The module-global trick to preserve the resource for the life of a server process doesn’t really work here, because Django explicitly closes its database connection when it’s done processing a request, and no amount of trickery from within Django will let you preserve it.

So the solution for this case is to turn to an external service; for example, at work we use PostgreSQL, and we use pgpool to keep a pool of database connections; Django can connect to pgpool in exactly the same way it would connect directly to the PostgreSQL database, and pgpool maintains a pool of connections so they don’t have to be constantly created and destroyed, resulting in (usually) a net performance gain.

Pooling connection managers are available for most of the databases Django supports, and in some cases they also provide a first step toward replication or failover and so provide extra utility on top of the pure connection-pooling functionality. And best of all, using an external service for this means that your overall application stack stays loosely coupled; so long as the application layer (in this case, Django) doesn’t need to know the details of things like how many database servers it’s “actually” talking to, swapping them in and out, or increasing or decreasing the number of them, is fairly easy.

There are other cases, though, where the module-global approach works just fine; for example, if you need to maintain a network connection to a particular service, and that connection is somewhat expensive to set up, assigning it to a global variable in the relevant module and letting it persist over a server process’ lifetime is generally a good solution.

What about sessions?

I’ve deliberately avoided suggesting Django’s session framework as an option here, although it does implement the primary HTTP-level method of persistent state (cookies). That’s because sessions really aren’t suitable and are an extremely unreliable method of handling this sort of thing; by their nature, they’re tied to an individual client (meaning you get to do your setup step once for each client), and there’s no guarantee that they’ll work at all (since some clients will have cookies disabled or won’t support cookies at all) or that the necessary data will persist long enough to be useful (since cookies expire and can be deleted client-side, and since some of the storage options on the server side aren’t necessarily long-lived). So while you could use sessions to handle some use cases for things which need to be initialized once and then persisted for future use, it’s really not a good idea.