Fun with queryset-refactor

Published: June 19, 2008. Filed under: Django.

Mixed in with my recent redesign and server move, I’ve taken the opportunity to update the Django trunk snapshot this site runs on; generally I snapshot a week or two after a big change, once I’ve had time to see any major bugs shake out and update the various applications I use. This time around the recent big change was the queryset-refactor branch landing in trunk. Most people have been focusing obsessively on one single feature QSRF reintroduced, but that’s a bit of a shame because it brought tons of useful improvements to Django’s ORM. So let’s take a look at a couple of neat things you can do now.

Reliable ordering by manual queries

My comment_utils application provides a custom managercomment_utils.managers.CommentedObjectManager — which provides a method named most_commented(). As the name implies, you can attach an instance of this manager to one of your models, and then the most_commented() method will return the instances of that model which have the highest number of comments, in order.

Back when I originally wrote the code, doing this query was a somewhat tricky proposition. Although Django’s ORM has long provided a way to insert arbitrary extra bits into a query, there were two problems with this:

  1. Properly interpolating dynamic parameters into any subqueries was tricky at best, and when it worked it mostly only worked by accident.
  2. Ordering by something selected through a manual addition to the query was unreliable; for the most part, you couldn’t do this.

These issues in turn caused problems for the most_commented() query, which needed to do both of these things: it needed to correctly interpolate parameters into a correlated subquery, and order by the result of that subquery.

So the first version of the most_commented() method ended up taking a slightly ugly path to its results: first it executed a raw query directly with the database cursor to get the primary-key values of the most-commented objects, then it used in_bulk() to fetch the actual objects and read them out in the right order. The returned value was a list, since trying to use a QuerySet would have clobbered the ordering.

But on current Django trunk this is much simpler; using the extra() method, you can now reliably drop parameters into the select portion of a subquery and order by a value added to the query by the extra() method.

This means there’s still a bit of “raw” SQL in the updated method — in order to correctly select the comment count for each object — but the actual work is all done inside one call to the Django ORM, and the return value is a properly-ordered QuerySet. Since this is kind of a neat thing, I’ll walk through how it’s done.

Anatomy of the most_commented() method

First up, querying against comments requires access to the comment models:

from django.contrib.comments import models as comment_models

And since a comment is “attached” to an object by means of two fields — one the primary-key value of the object, the other a foreign key to its content type, this will also need the ContentType model:

from django.contrib.contenttypes.models import ContentType

Finally, we’ll use some of Django’s database helper functions to properly quote values in the subquery, so we’ll grab the necessary module. We’ll also need Django’s SortedDict data structure in order to maintain query fragments and query parameters in the right order:

from django.db import connection
from django.utils.datastructures import SortedDict

Now we can start writing the method. It’ll take two arguments: num will default to 5 and is the number of objects to return, and free will default to True and indicate whether to use the FreeComment model (the default) or the Comment model. The first thing we’ll do is grab a reference to the function django.db.connection.ops.quote_name(), which is used to quote values for the database, and we’ll alias it to a slightly shorter name (in some cases doing this can also offer a slight performance boost, but that’s a topic for another day). And since it’s a manager method, it’ll take self like any other Python instance method:

def most_commented(self, num=5, free=True):
    qn = connection.ops.quote_name

Now we need to get the options for the right comment model (since we’ll need to introspect that model to get some information on the database tables to use), and we need to work out the ContentType of the model this manager is attached to:

comment_opts = free and comment_models.FreeComment._meta or comment_models.Comment._meta
ctype = ContentType.objects.get_for_model(self.model)

And now we need to build the subquery we’ll be using. First let’s see an example of what it ought to look like; if you had, say, a model Entry in an application named blog, with a ContentType id of 12, and wanted to know how many FreeComment instances are attached to the Entry with id 54, the query would look like this:

SELECT COUNT(*)
FROM comments_freecomment
WHERE comments_freecomment.content_type_id = 12
AND comments_freecomment.object_id = 54;

Now, we want this to be a correlated subquery of a query against an arbitrary model, which means some of these parameters will need to be dynamic. Since I’m using comment moderation, I also want to only consider comments whose is_public field is set to True, which introduces one more quirk into the subquery. Here’s what it looks like, with database tables and column names filled in properly, and placeholders for two parameters (the ContentType id and the True value for the is_publiccolumn; this value is used as a parameter to ensure the database backend takes care of converting the boolean value into the right format for the DB in use):

subquery = """SELECT COUNT(*)
FROM %(comment_table)s
WHERE %(comment_table)s.%(content_type_id)s = %%s
AND %(comment_table)s.%(object_id)s = %(self_table)s.%(pk)s
AND %(comment_table)s.%(is_public)s = %%s
""" % { 'comment_table': qn(comment_opts.db_table),
        'content_type_id': qn('content_type_id'),
        'object_id': qn('object_id'),
        'self_table': qn(self.model._meta.db_table),
        'pk': qn(self.model._meta.pk.name),
        'is_public': qn('is_public'),
}

If this looks weird, walk through it a step at a time:

And now it’s almost trivially easy to use extra() to run this query. We need to pass three arguments to it:

  1. select will be a dictionary; each key will be the name of an extra section to add to the SELECT clause of the query, and each value in it will be the actual query to run. In order to preserve the ordering of parameters, this will actually be a SortedDict.
  2. select_params is the list of parameters to be used in the select argument.
  3. order_by is a list of values to order by; this is similar to the order_by() method of a QuerySet, but when used as an argument to extra() it can ensure ordering by a value added by a clause in the extra() method.

Here’s what the call looks like:

return self.extra(select=SortedDict({ 'comment_count': subquery }),
                  select_params=(ctype.id, True,),
                  order_by=['-comment_count'])[:num]

The result of this will be a correctly-ordered QuerySet, limited to the correct number of objects. For example, I can run this against my blog’s database (live is a custom manager on my Entry model which subclasses the CommentedObjectManager from comment_utils):

>>> from coltrane.models import Entry
>>> Entry.live.most_commented()
[<Entry: Django and AJAX>, <Entry: Where is Django's blog application?>, <Entry: Kick me>, <Entry: Minimal>, <Entry: Why HTML>]

One nice bonus from using extra() is that the comment_count value it selects is added to each object as an attribute:

>>> for e in Entry.live.most_commented():
...     print "%s: %s" % (e.title, e.comment_count)
...
Django and AJAX: 52
Where is Django's blog application?: 42
Kick me: 41
Minimal: 36
Why HTML: 34

Go have fun

There are all sorts of useful fixes, improvements and new features in Django’s ORM as a result of QSRF; I’ve already used a few in updates to my own applications, and plan to use a few more in the future (one even made its way into my book). So take some time to look over the database API documentation and, if you’ve got a few minutes to spare, dig into the source code; it’s quite clean and readable, and there are some useful bits of infrastructure lurking in the code now which could support some interesting and long-sought-after features (including things like easy aggregates and multiple-database support).

And if you find something you love, share it with the world over at djangosnippets.