Mixed in with my recent redesign and server move, I’ve taken the opportunity to update the Django trunk snapshot this site runs on; generally I snapshot a week or two after a big change, once I’ve had time to see any major bugs shake out and update the various applications I use. This time around the recent big change was the queryset-refactor branch landing in trunk. Most people have been focusing obsessively on one single feature QSRF reintroduced, but that’s a bit of a shame because it brought tons of useful improvements to Django’s ORM. So let’s take a look at a couple of neat things you can do now.
My comment_utils application provides a custom manager — comment_utils.managers.CommentedObjectManager — which provides a method named most_commented(). As the name implies, you can attach an instance of this manager to one of your models, and then the most_commented() method will return the instances of that model which have the highest number of comments, in order.
Back when I originally wrote the code, doing this query was a somewhat tricky proposition. Although Django’s ORM has long provided a way to insert arbitrary extra bits into a query, there were two problems with this:
These issues in turn caused problems for the most_commented() query, which needed to do both of these things: it needed to correctly interpolate parameters into a correlated subquery, and order by the result of that subquery.
So the first version of the most_commented() method ended up taking a slightly ugly path to its results: first it executed a raw query directly with the database cursor to get the primary-key values of the most-commented objects, then it used in_bulk() to fetch the actual objects and read them out in the right order. The returned value was a list, since trying to use a QuerySet would have clobbered the ordering.
But on current Django trunk this is much simpler; using the extra() method, you can now reliably drop parameters into the select portion of a subquery and order by a value added to the query by the extra() method.
This means there’s still a bit of “raw” SQL in the updated method — in order to correctly select the comment count for each object — but the actual work is all done inside one call to the Django ORM, and the return value is a properly-ordered QuerySet. Since this is kind of a neat thing, I’ll walk through how it’s done.
most_commented() methodFirst up, querying against comments requires access to the comment models:
from django.contrib.comments import models as comment_models
And since a comment is “attached” to an object by means of two fields — one the primary-key value of the object, the other a foreign key to its content type, this will also need the ContentType model:
from django.contrib.contenttypes.models import ContentType
Finally, we’ll use some of Django’s database helper functions to properly quote values in the subquery, so we’ll grab the necessary module. We’ll also need Django’s SortedDict data structure in order to maintain query fragments and query parameters in the right order:
from django.db import connection from django.utils.datastructures import SortedDict
Now we can start writing the method. It’ll take two arguments: num will default to 5 and is the number of objects to return, and free will default to True and indicate whether to use the FreeComment model (the default) or the Comment model. The first thing we’ll do is grab a reference to the function django.db.connection.ops.quote_name(), which is used to quote values for the database, and we’ll alias it to a slightly shorter name (in some cases doing this can also offer a slight performance boost, but that’s a topic for another day). And since it’s a manager method, it’ll take self like any other Python instance method:
def most_commented(self, num=5, free=True): qn = connection.ops.quote_name
Now we need to get the options for the right comment model (since we’ll need to introspect that model to get some information on the database tables to use), and we need to work out the ContentType of the model this manager is attached to:
comment_opts = free and comment_models.FreeComment._meta or comment_models.Comment._meta ctype = ContentType.objects.get_for_model(self.model)
And now we need to build the subquery we’ll be using. First let’s see an example of what it ought to look like; if you had, say, a model Entry in an application named blog, with a ContentType id of 12, and wanted to know how many FreeComment instances are attached to the Entry with id 54, the query would look like this:
SELECT COUNT(*) FROM comments_freecomment WHERE comments_freecomment.content_type_id = 12 AND comments_freecomment.object_id = 54;
Now, we want this to be a correlated subquery of a query against an arbitrary model, which means some of these parameters will need to be dynamic. Since I’m using comment moderation, I also want to only consider comments whose is_public field is set to True, which introduces one more quirk into the subquery. Here’s what it looks like, with database tables and column names filled in properly, and placeholders for two parameters (the ContentType id and the True value for the is_publiccolumn; this value is used as a parameter to ensure the database backend takes care of converting the boolean value into the right format for the DB in use):
subquery = """SELECT COUNT(*) FROM %(comment_table)s WHERE %(comment_table)s.%(content_type_id)s = %%s AND %(comment_table)s.%(object_id)s = %(self_table)s.%(pk)s AND %(comment_table)s.%(is_public)s = %%s """ % { 'comment_table': qn(comment_opts.db_table), 'content_type_id': qn('content_type_id'), 'object_id': qn('object_id'), 'self_table': qn(self.model._meta.db_table), 'pk': qn(self.model._meta.pk.name), 'is_public': qn('is_public'), }
If this looks weird, walk through it a step at a time:
%%s instead of %s in order to keep the interpolation of database table and column names from trying to fill them in. They’ll end up as %s in the final subquery, and thus can be replaced with the final parameters when the query runs.
And now it’s almost trivially easy to use extra() to run this query. We need to pass three arguments to it:
select will be a dictionary; each key will be the name of an extra section to add to the SELECT clause of the query, and each value in it will be the actual query to run. In order to preserve the ordering of parameters, this will actually be a SortedDict.
select_params is the list of parameters to be used in the select argument.
order_by is a list of values to order by; this is similar to the order_by() method of a QuerySet, but when used as an argument to extra() it can ensure ordering by a value added by a clause in the extra() method.
Here’s what the call looks like:
return self.extra(select=SortedDict({ 'comment_count': subquery }), select_params=(ctype.id, True,), order_by=['-comment_count'])[:num]
The result of this will be a correctly-ordered QuerySet, limited to the correct number of objects. For example, I can run this against my blog’s database (live is a custom manager on my Entry model which subclasses the CommentedObjectManager from comment_utils):
>>> from coltrane.models import Entry >>> Entry.live.most_commented() [<Entry: Django and AJAX>, <Entry: Where is Django's blog application?>, <Entry: Kick me>, <Entry: Minimal>, <Entry: Why HTML>]
One nice bonus from using extra() is that the comment_count value it selects is added to each object as an attribute:
>>> for e in Entry.live.most_commented(): ... print "%s: %s" % (e.title, e.comment_count) ... Django and AJAX: 52 Where is Django's blog application?: 42 Kick me: 41 Minimal: 36 Why HTML: 34
There are all sorts of useful fixes, improvements and new features in Django’s ORM as a result of QSRF; I’ve already used a few in updates to my own applications, and plan to use a few more in the future (one even made its way into my book). So take some time to look over the database API documentation and, if you’ve got a few minutes to spare, dig into the source code; it’s quite clean and readable, and there are some useful bits of infrastructure lurking in the code now which could support some interesting and long-sought-after features (including things like easy aggregates and multiple-database support).
And if you find something you love, share it with the world over at djangosnippets.
Comments for this entry are closed. If you'd like to share your thoughts on this entry with me, please contact me directly.
NOTE: you didn’t need to wait for QSR. the only new thing was adding select_params (instead of using params which would often cause wierd errors). And there are still bugs with param_values, so I stay away from it and use % and quotename() to build the end query from the subquery.
extra() has been around for quite some time.
Doug, as I said,
extra()has been around for a long time. But being able to reliably do parameter interpolation and custom ordering fromextra()is new. Hence you couldn’t really make use of it until QSRF.Sorry, yes, the order_by on select values is what is new. previously you had to include the ORDER BY by hand in your generated subquery text.
I still do not recommend using select_params as it is still has some serious bugs. Pinax was getting very strange sql syntax errors against sqlite3 (syntax error near ‘?’…) and the generated ‘str(qs.query)’ looked perfect (and when sent to the cursor worked great). Removing select_params and doing the extra % by hand fixed the problem ‘magicly’. I was also seeing some odd behavior in PyCon-Tech and removing select_params fixed the problem in each and every case.
the additional % is just not that big a deal.
Sorry, looking at your subquery declaration I cannot help but think that something must be horribly wrong here.
An ORM is supposed to make things easier for us, not force us to come up with obfuscated noise like that.
It may feel totally natural to you as a django-developer. To me as a would-be-user there’s way too much futzing with the internals. I’d much rather just drop to plain SQL in that case.
Further I wonder why your schema is not set up to make the counting as simple as:
select count(*) as cnt, articleid from comments group by articleid order by cnt desc
As said, something smells very fishy about this example. Neither should the task at hand (“most commented”) require a subquery, nor should setting one up be such a chore in django.
James White: This will be a pretty trivial query once aggregates are added to the ORM, theoretically in version 1.1
Until then, it requires custom SQL, custom bits of SQL (used above), or the use of another ORM (like SQLAlchemy).
Because Django’s comment system is designed to be generic, and allow one table to represent comments “attached” to any other record. This means there is no direct database-level relation between, say, a row in the comments table and a row in the blog entries table. As a result, the query becomes more complex.
The manual bit of the query is also a bit more complex because I’m deliberately avoiding hard-coding any table names into it and instead using introspection of model classes to find out what they should be, which results in a lot of interpolation of values into the string that becomes the subquery.
Hm, I wonder if one could write the subquery in ORM too, and then pass subquery.query.as_sql() along to extra().