“Server startup” and its discontents
A fairly common question people ask on the Django mailing lists and in the IRC channel is “how do I have something run when the server starts up?” This is a somewhat tricky question to answer, because in many ways there’s no concept of “server start” in Django, and because the reasons behind the question vary from one use case to the next. So let’s take a look at just what people are wanting to do at “server start”, what that means (and, just as importantly, what it doesn’t mean) and how to handle it in a Django-based application.
The protocol without a state
Every so often someone will complain about frameworks like Django or Rails, alleging that they’re “just” light wrappers around HTTP (for example) and implying that this is a bug which needs to be fixed. Personally, I think this overlooks the power of HTTP, and that staying close to the nature of the underlying protocol is really a strength. But it does require a different way of thinking about things.
HTTP, as everyone hopefully knows, is a stateless protocol; this means that HTTP, by definition, has no “memory” of what’s gone on in previous requests, and won’t “remember” the current request later on. The earliest standard for web programming — CGI — emulated this by being designed in such a way that a new, separate copy of the application was launched for every request and shut down after every response, which meant there was no possible way for the application to persist (unaided) between requests and maintain some notion of what state it was in or what it had previously done.
Of course, there are some ways to maintain application state in HTTP; cookies, for example, let you serialize information out to a client (typically a web browser, though not always) and get that information back, which lets you do some “where was I when we last left off” calculations. But the existence of a limited loophole doesn’t really change the statelessness of HTTP as a whole; the cookie information still has to be retransmitted with every single request, because the server isn’t going to remember it.
Django can be run under CGI, but it’s horribly inefficient to do so; instead, some mechanism (mod_python, mod_wsgi, FastCGI, etc.) is typically used to load the code and keep it resident in memory over the lifetime of a server process, which means that, just as with cookies at the HTTP level, there are some loopholes in the “no persistent state” rule. This still means you get a completely separate copy of the application in every server process, though, so that there’s still no concept of a global “application state”, and Django deliberately initializes certain resources (database connections, for example) at the start of a request and releases them when the response has been sent out, which means that the opportunities for persistent state are somewhat limited.
So the first thing to do is let go of the notion of “run this code when the server starts”, and instead start thinking in terms closer to what’s actually happening. For that you have two choices:
- If you’re not running under CGI (and you really don’t want to run Django under plain CGI) you can work in terms of server processes, and accept that each server process is completely independent of the others and will have its own copy of any code and data you load.
- You can work purely in terms of request/response cycles, which is (mostly) what Django does.
Use cases
Broadly speaking, there are three classes of use cases where people end up asking how to do something at “server start” or to otherwise have something happen once and then persist across request/response cycles:
- Initialization routines which have to happen before an application can begin processing requests properly.
- Large or expensive data sets which need to be loaded once and then made accessible from each request.
- External resources, such as database connections, which need to be set up and then preserved between requests for performance reasons.
And (again, broadly speaking) there are two major ways of dealing with these sorts of cases:
- Execute something when a relevant module is first imported, and have it set up global state for the module which will persist over the life of the server process (if you don’t already understand how Python’s import process works, you really should go learn about it).
- Use an external service which does have some concept of state, or at least has a persistence mechanism, to maintain needed data or resources and provide them to your application on demand.
So let’s look at some examples and see how to put these ideas to use.
Maintaining stable registries
In order to make its ORM work, Django needs to know which applications are installed in your project and what models those applications provide. The list of applications comes from your INSTALLED_APPS setting, but looping through the applications, reading their model modules and instantiating the necessary model classes can be a somewhat expensive operation; there may be a lot of imports and a lot of files to read, and the process of setting up a Django model class can be fairly complex. If you’ve ever noticed a slight lag between the moment you type python manage.py shell and the moment you actually get a Python prompt, you’ll know what I mean.
We’ve already seen a hint of how Django deals with this, when we looked at the model-loading machinery:
all of the functions listed above are actually implemented as methods on a class called
AppCache, which serves as a registry of your project’s available models
The module where the model-loading code lives — django.db.models.loading — creates a module-global instance of AppCache when it’s first imported, and sets up get_model() and friends as aliases to methods of that instance. This instance of AppCache will persist for as long as the module stays in memory (so, for the lifetime of a server process), which means that all of the model loading only has to be done once per process. Of course, in a CGI setup where everything is unloaded as soon as the response goes out, this wouldn’t help any, but in a normal use case under a server setup where the code stays resident in memory for an extended time it’ll yield a pretty good performance boost.
Django’s newforms-admin branch, which will eventually replace the admin application, uses a similar trick: in the default setup, you’ll access a single (per-process) instance of a class called AdminSite, and register models with it so the admin will know which models it needs to handle. Two of my own apps also use this technique:
-
Comment moderation in comment_utils is handled by instantiating a
Moderatorand registering models with it, and a module-global instance ofModeratoris provided for the default use case, which means that only one object has to hold all the necessary dispatcher connections to handle moderation. -
Text-to-HTML conversion in template_utils is handled by a
MarkupFormatterclass, which — again — has a module-global instance you can register formatting functions with, so that you only need to register them once and can then use them by importing that instance ofMarkupFormatter.
Handling initialization
The technique of a module-global variable works for more than just persistent state; if your application needs to run some code once to set itself up, it can make use of a module-global flag to figure out whether that’s happened yet. For example, you might have some code which looks like this:
setup_has_run = False def get_config_value(value): global setup_has_run if not setup_has_run: run_setup() setup_has_run = True # ...get and return the configuration value...
Django uses a similar trick to avoid constantly re-importing your settings (and to avoid, in some cases, the need to specify a settings module until the settings are actually needed): django.conf exports the variable settings, which is an instance of a class called LazySettings; when you try to access an attribute of it, it checks to see if it’s loaded your settings module yet, does so only if needed, and then keeps the loaded settings in an attribute of the LazySettings instance. Since Django has to do a bit of work to actually set up your settings (including filling in default values for a lot of settings if you don’t specify them), this generaly leads to improved performance.
Expensive data
Another common case where people want to have something run on “server start” is a situation where there’s some chunk of data that’s expensive to calculate or load (possibly because there’s a lot of data), but which will be needed in the course of serving requests; one example I’ve actually seen was an online spell-checking application which needed to load a large dictionary of words, common typos and pre-calculated Levenshtein distances.
Depending on the nature of the data you need to have access to, using Django’s cache framework to load and store it can be an effective option; you can check the cache on a per-request or per-process basis to see if ithe data has already been loaded, then read it out if it has or load it and stick it into the cache (with a suitably large timeout) if it hasn’t.
Of course, this still may not be good enough; some things are simply too expensive to keep generating or reading into/out of a cache, and for those cases you really need to turn to an external service. For example, most search systems which have Python bindings will need to have specialized index structures available, and it’s almost always more effective to either have an external service generating index files to be read off disk when needed, or to have the whole search system run as a a service and maintain its own state (Grassy Knoll is an example of this; it runs as a web service, and exposes a REST interface you can send requests to).
Maintaining resources
The big use case here is keeping database connections alive between request/response cycles; on some database systems, obtaining a connection is expensive enough that it can drag down a high-traffic site’s performance. The module-global trick to preserve the resource for the life of a server process doesn’t really work here, because Django explicitly closes its database connection when it’s done processing a request, and no amount of trickery from within Django will let you preserve it.
So the solution for this case is to turn to an external service; for example, at work we use PostgreSQL, and we use pgpool to keep a pool of database connections; Django can connect to pgpool in exactly the same way it would connect directly to the PostgreSQL database, and pgpool maintains a pool of connections so they don’t have to be constantly created and destroyed, resulting in (usually) a net performance gain.
Pooling connection managers are available for most of the databases Django supports, and in some cases they also provide a first step toward replication or failover and so provide extra utility on top of the pure connection-pooling functionality. And best of all, using an external service for this means that your overall application stack stays loosely coupled; so long as the application layer (in this case, Django) doesn’t need to know the details of things like how many database servers it’s “actually” talking to, swapping them in and out, or increasing or decreasing the number of them, is fairly easy.
There are other cases, though, where the module-global approach works just fine; for example, if you need to maintain a network connection to a particular service, and that connection is somewhat expensive to set up, assigning it to a global variable in the relevant module and letting it persist over a server process’ lifetime is generally a good solution.
What about sessions?
I’ve deliberately avoided suggesting Django’s session framework as an option here, although it does implement the primary HTTP-level method of persistent state (cookies). That’s because sessions really aren’t suitable and are an extremely unreliable method of handling this sort of thing; by their nature, they’re tied to an individual client (meaning you get to do your setup step once for each client), and there’s no guarantee that they’ll work at all (since some clients will have cookies disabled or won’t support cookies at all) or that the necessary data will persist long enough to be useful (since cookies expire and can be deleted client-side, and since some of the storage options on the server side aren’t necessarily long-lived). So while you could use sessions to handle some use cases for things which need to be initialized once and then persisted for future use, it’s really not a good idea.
November 5, 2007
#
Not that java is a nice language because its not but the guy with the comment about wicket has a point, component based webapps are extremly powerful. A page does consist of multiple components and with a component based approch, each component could have its own state that the rest of the page doesnt need to know about. In this way template designed webapps are a mess, of course with django you cold make template tags that would look like a component but it is really not. and if you look at rails they have there partials. Dont get me wrong here, I love django, its really great, but its not because of its template system, its more because of newsforms and the way django uses models. But a page to me really just is a placeholder for several components where each component could have its own templates with html, wikimarkup or whatever, its up to each component to decide. It also makes it very easy for a project with several developers where each developer could own his own components. I would love to see django go this direction but If you look at mazonHQ, seaside and those other component based frameworks out there today i think they scare most people because html and interface “developers” likes their html… Internationalization in django is pretty great but what about localization? With a component based application you make it dead easy, because you can put different components on same page on different localizations. Theres one more thing… with the release of google OpenSocial there should not be any doubt that the future is components… they just call it widgets, mini apps etc..
November 5, 2007
#
Hi James, good comprehensive post as always. :)
If you’d like Django to grow a concept of “server start”, I’ve begun fleshing out a proto-PEP at http://www.cherrypy.org/wiki/WSPBSpec. It should allow a future version of Django (and other frameworks) to take a component approach to initialization and shutdown that is more understandable, debuggable, and interoperable.
November 5, 2007
#
Martin,
This same type of system can be developed via base templates, tags, middleware and context processors. There is even a google djanjo- code project which sets up such a framework (sorry cant find the link).
For PyCon-Tech we have a number of components which can be used, and are dropped in in a similar way (feedutil, navbar, rstfile).
The combination of middleware, contextprocessors makes it quire easy to do localization and internationalization. Django already does this for internationalization as you hint at, and adding a custom pair for django.contrib.localflavor is quite easy as well (and should be added).
Personally I see a component framework to be something developed on top of the django framework, not part of the core. I am not sure I would even like to see it as a contrib app, as there are many, many, many different types of component frameworks, each with their own tradeoffs.
Component frameworks have their advantages, but they are not the end all be all of web development. Also anyone who thinks that making everything a widget will solve all their problems, allows developer to work in isolation, and will ‘just work’, has never dealt with an actual real-world large scale production site.
November 5, 2007
#
Doug, I think you’re thinking of Marty Alchin’s django-modular project.
Robert, I’m not necessarily opposed to the idea, but I also am a fan of staying close to the “bare HTTP” conceptually ;)
I’ll give it a read.
November 5, 2007
#
Martin, my biggest concern about that sort of architecture is that it seems to be too optimized for a specific case: the home page of a portal-style application. Once you’re into a detail page where one type (and generally one piece) of content utterly dominates the results, all the componentization becomes largely useless. And, since the vast majority of things on the Web are pages dominated by one type, and generally one piece, of content, well… ;)
November 5, 2007
#
Your definition of statelessness might benefit from Roy Fieldings definition: “communication must be stateless in nature, as in the client-stateless-server (CSS) style of Section 3.4.3 (Figure 5-3), such that each request from client to server must contain all of the information necessary to understand the request, and cannot take advantage of any stored context on the server. Session state is therefore kept entirely on the client.”
The defining element for a request to be stateless is that it is self-contained, i.e. contains all relevant data to understand the request. A cookie in itself doesn’t necessarily violate statelessness per se, that depends on its content and whether the server needs to keep state to process the cookie.
November 6, 2007
#
Doug, of course a component based approach is not always the best solution, i usally say rails is good for backend services and django for frontend applications. you always need to look at what you need, theres no super solution for all problems. Though this is what most rails people think.
November 7, 2007
#
I have some initialization that I want to do when my code is run in a server process but not when loaded in the shell or by a utility program that is using the same Django model as the server. I couldn’t figure out a clean way to do this - I can’t put initialization code into a model or settings module because they are loaded in both cases. Ultimately I created a dummy middleware class which does the required initialization when it is loaded. The server startup ensures that the middleware is loaded before the first request is processed but it is not loaded for non-server use.
One example of the kind of initialization I have in the startup module is initialization of the standard logging module.
November 7, 2007
#
Kent, that is exactly the sort of problem the WSPB is designed to address. Not that you wouldn’t have to write a class to package up that logging initialization; but the proposal would provide a standard way to write it and register it to run at startup (or not) across frameworks.