Database heresies

August 4, 2008 Frameworks, Pedantics, Programming

While scanning reddit, I saw an article pop up by Jeff Davis lamenting the way most people interact with databases, particularly when it comes to ORMs. Jeff seems to be pointing out (and, to an extent, conflating) two issues:

At the moment, programming languages and SQL don’t really mesh all that well.
Most people, in Jeff’s opinion, take the wrong approach to working with a database from their programming language of choice.

The first point is one I’m happy to concede; SQL is a fundamentally different beast from, well, pretty much any other language you’re likely to encounter, and as such common/popular programming languages run into a big impedance mismatch the moment they start trying to do SQL. To really expose all the functionality of SQL to your favorite programming language, you need either extensive language-level integration of SQL-like features (see, for example, LINQ on .NET) or you need to spend a lot of time and effort developing a toolkit to bridge the gap.

So I’m in agreement on that point. But the other item — the “wrong” approach to database interaction — is one I’ve got some issues with.

Two ways to do it

Jeff lays out two approaches to database interaction. First, the one he doesn’t like:

The wrong approach, which includes all persistence layers and most ORMs, use object-oriented design to manage data (by generating simple store-and-retrieve SQL code) instead of the relational model; and mash together transient application data with permanent facts that should be made available to other application instances. This approach considerably increases complexity unless the data is highly independent (that is, one application instance doesn’t care what another instance is doing).

Meanwhile, there’s the “right way”:

The idea is to allow the developers to use arbitrary relational expressions in their application language, and cleanly separates transient application data structures from the permanent facts that should be made available to other application instances. The complex relational expressions in the application language are then translated to equivalent SQL — not simple store-and-retrieve, but real SQL expressions — and then passed to the DBMS which can then optimize and execute the expressions, and the results can be returned and understood by the application language.

This dichotomy is somewhat similar to what I’ve been saying, for a couple years now, to anyone who’s willing to listen to me ramble about ORM design. When you get right down to it, I think there are two ways to come at the problem of ORM:

I’ve got some objects in my code, and I’d like to persist them and be able to fetch them back according to their attributes, so I’m using a relational database as the store.
I’ve got a relational database and I need to interface with it from my programming language, so I’m using object-oriented techniques to do that.

These roughly correspond to Jeff’s “wrong” and “right” approaches, respectively, but I’m not convinced that there is a thing as “wrong” or “right” approach here.

Why the RDBMS won

I really think that the success of relational databases is and has been largely an accident of history. If object databases or document databases or RDF stores or anything else had been widely/freely available and “good enough” at certain critical points over the last couple decades, I suspect they’d have taken over instead of relational databases, and we’d be seeing articles from purists in those fields decrying the way programmers just use them as dumb data stores.

This causes no end of annoyance for hardcore database guys, but it’s true: it just so happens that a lot of hype and industry support for “relational” databases popped up at just the right time. Of course, the really hardcore relational purists are looking down their noses at all of us and describing in detail all the ways that SQL and its implementations sold out and compromised Codd’s vision of the relational model, but that’s neither here nor there.

This is a drama we’ve seen play out over and over again in the history of computing: solutions that are good enough at the time, and which can be implemented with the minimum amount of prerequisite knowledge or complexity, win. Ask an old-school Smalltalk programmer about C++ and Java sometime. Ask a trained document specialist what she thinks of HTML. Or ask a relational geek about MySQL. All of these are second-rate, partial implementations of vastly more elegant and powerful ideas, but they all took off and left the purists in the dust.

It turned out that relational databases were in the right place at the right time, and the “good enough” implementations and uses took over the world. Or, in other words, we’re looking at yet another instance of Worse is Better. If you’ve never read that essay, please do so now.

So what do you really need?

The key insight of the “worse is better” philosophy is that it’s often preferable to have a simple, partial implementation now than to have a complex full implementation at some unspecified point in the future. The current crop of web frameworks present a useful microcosm we can study to see this in action.

The “wrong” way to do ORM is to start out with a basic Active Record implementation which maps one class to one table, with fields on the class as columns and instances of the class as rows. Then you layer on more features and stronger SQL support as needed. This is what Rails has done. This is what Django has done. This is what the popular PHP frameworks have done.

Meanwhile there are ORM packages available for modern frameworks which are designed the “right” way, but:

They’re huge and complex, which means that a lot of development effort has to go into building simpler, friendlier interfaces to them.
Because of this, they’re getting their asses kicked by ORMs which started out simple and supported just enough features to be useful.

This means that a lot of people end up using an ORM which does things the “wrong” way, but that’s OK because it turns out there’s a huge market out there full of people who care about the “database” part much more than they care about the “relational” part, if they even care about that at all.

And really, I think a lot of the complaints we see from purists about the “right” way to use databases and do ORM comes down to misunderstanding what that market wants and needs. Yes, there are and always will be places where there’s a single massive corporate database with dozens or hundreds of applications accessing it. Yes, those places need to worry about doing things the “right” way if they want to keep all of that stuff working.

But that’s just one part of a very large world. There are plenty of places — in fact, an ever-increasing number of them — where everything runs in reverse: they start out with an application, and then need a place to keep its data. For nearly all of these folks, the considerations which require doing things the “right” way are simply irrelevant and probably always will be. And the purists are, frankly, out of touch with this market; they’re out on the edge while everybody else is in the big belly.

Why there should be two ways to do ORM

Which gets back to the two ways you can approach ORM. The “right” way, according to the database purists, is to make the database king and think exclusively in relational terms. What matters is implementing the underlying relational model as completely and as correctly as possible. The “wrong” way is to favor the application, and treat the DB as a data store to back the objects which exist in your code.

The “wrong” way has been hugely successful, since it’s easy and it’s good enough for most people in most situations. This means that we need to have ORMs which do things “wrong”; without them, we’d live in a much less interesting world. Of course, most people in most situations are not all people in all situations, which means we also need to have ORMs which do things “right”; without them, we’d once again live in a much less interesting world.

And, of course, both camps end up searching for the ideal happy medium: the “wrong”-style ORMs either start building in more advanced features or they die off, and the “right”-style ORMs either introduce simpler interfaces or they never have users outside of the niche that’s willing to put up with the way they work. To take the canonical ORM debate from the Python world, this is why Django’s bundled ORM got a huge under-the-hood refit from Malcolm that introduced all sorts of interesting new capabilities (more on that some other day), and why so many people are working steadily on Elixir so that SQLAlchemy will be usable by mere mortals.

I’ll admit that, once upon a time, I had some misgivings about the “wrong” sort of ORM, so I have some idea of where the purists are coming from. But after a couple of years working, day in and day out, with one of the alleged worst offenders, I’ve realized that there’s room for libraries that don’t expose everything your database can do, and that debates about who’s “right” and who’s “wrong” take a back seat to the much more important question of “who’s actually getting shit done?” As it turns out, people on both sides.