Database heresies

An entry published by James Bennett on August 4, 2008, Part of the categories Frameworks, Pedantics and Programming. 21 comments posted.

While scanning reddit, I saw an article pop up by Jeff Davis lamenting the way most people interact with databases, particularly when it comes to ORMs. Jeff seems to be pointing out (and, to an extent, conflating) two issues:

  1. At the moment, programming languages and SQL don’t really mesh all that well.
  2. Most people, in Jeff’s opinion, take the wrong approach to working with a database from their programming language of choice.

The first point is one I’m happy to concede; SQL is a fundamentally different beast from, well, pretty much any other language you’re likely to encounter, and as such common/popular programming languages run into a big impedance mismatch the moment they start trying to do SQL. To really expose all the functionality of SQL to your favorite programming language, you need either extensive language-level integration of SQL-like features (see, for example, LINQ on .NET) or you need to spend a lot of time and effort developing a toolkit to bridge the gap.

So I’m in agreement on that point. But the other item — the “wrong” approach to database interaction — is one I’ve got some issues with.

Two ways to do it

Jeff lays out two approaches to database interaction. First, the one he doesn’t like:

The wrong approach, which includes all persistence layers and most ORMs, use object-oriented design to manage data (by generating simple store-and-retrieve SQL code) instead of the relational model; and mash together transient application data with permanent facts that should be made available to other application instances. This approach considerably increases complexity unless the data is highly independent (that is, one application instance doesn’t care what another instance is doing).

Meanwhile, there’s the “right way”:

The idea is to allow the developers to use arbitrary relational expressions in their application language, and cleanly separates transient application data structures from the permanent facts that should be made available to other application instances. The complex relational expressions in the application language are then translated to equivalent SQL — not simple store-and-retrieve, but real SQL expressions — and then passed to the DBMS which can then optimize and execute the expressions, and the results can be returned and understood by the application language.

This dichotomy is somewhat similar to what I’ve been saying, for a couple years now, to anyone who’s willing to listen to me ramble about ORM design. When you get right down to it, I think there are two ways to come at the problem of ORM:

  1. I’ve got some objects in my code, and I’d like to persist them and be able to fetch them back according to their attributes, so I’m using a relational database as the store.
  2. I’ve got a relational database and I need to interface with it from my programming language, so I’m using object-oriented techniques to do that.

These roughly correspond to Jeff’s “wrong” and “right” approaches, respectively, but I’m not convinced that there is a thing as “wrong” or “right” approach here.

Why the RDBMS won

I really think that the success of relational databases is and has been largely an accident of history. If object databases or document databases or RDF stores or anything else had been widely/freely available and “good enough” at certain critical points over the last couple decades, I suspect they’d have taken over instead of relational databases, and we’d be seeing articles from purists in those fields decrying the way programmers just use them as dumb data stores.

This causes no end of annoyance for hardcore database guys, but it’s true: it just so happens that a lot of hype and industry support for “relational” databases popped up at just the right time. Of course, the really hardcore relational purists are looking down their noses at all of us and describing in detail all the ways that SQL and its implementations sold out and compromised Codd’s vision of the relational model, but that’s neither here nor there.

This is a drama we’ve seen play out over and over again in the history of computing: solutions that are good enough at the time, and which can be implemented with the minimum amount of prerequisite knowledge or complexity, win. Ask an old-school Smalltalk programmer about C++ and Java sometime. Ask a trained document specialist what she thinks of HTML. Or ask a relational geek about MySQL. All of these are second-rate, partial implementations of vastly more elegant and powerful ideas, but they all took off and left the purists in the dust.

It turned out that relational databases were in the right place at the right time, and the “good enough” implementations and uses took over the world. Or, in other words, we’re looking at yet another instance of Worse is Better. If you’ve never read that essay, please do so now.

So what do you really need?

The key insight of the “worse is better” philosophy is that it’s often preferable to have a simple, partial implementation now than to have a complex full implementation at some unspecified point in the future. The current crop of web frameworks present a useful microcosm we can study to see this in action.

The “wrong” way to do ORM is to start out with a basic Active Record implementation which maps one class to one table, with fields on the class as columns and instances of the class as rows. Then you layer on more features and stronger SQL support as needed. This is what Rails has done. This is what Django has done. This is what the popular PHP frameworks have done.

Meanwhile there are ORM packages available for modern frameworks which are designed the “right” way, but:

  1. They’re huge and complex, which means that a lot of development effort has to go into building simpler, friendlier interfaces to them.
  2. Because of this, they’re getting their asses kicked by ORMs which started out simple and supported just enough features to be useful.

This means that a lot of people end up using an ORM which does things the “wrong” way, but that’s OK because it turns out there’s a huge market out there full of people who care about the “database” part much more than they care about the “relational” part, if they even care about that at all.

And really, I think a lot of the complaints we see from purists about the “right” way to use databases and do ORM comes down to misunderstanding what that market wants and needs. Yes, there are and always will be places where there’s a single massive corporate database with dozens or hundreds of applications accessing it. Yes, those places need to worry about doing things the “right” way if they want to keep all of that stuff working.

But that’s just one part of a very large world. There are plenty of places — in fact, an ever-increasing number of them — where everything runs in reverse: they start out with an application, and then need a place to keep its data. For nearly all of these folks, the considerations which require doing things the “right” way are simply irrelevant and probably always will be. And the purists are, frankly, out of touch with this market; they’re out on the edge while everybody else is in the big belly.

Why there should be two ways to do ORM

Which gets back to the two ways you can approach ORM. The “right” way, according to the database purists, is to make the database king and think exclusively in relational terms. What matters is implementing the underlying relational model as completely and as correctly as possible. The “wrong” way is to favor the application, and treat the DB as a data store to back the objects which exist in your code.

The “wrong” way has been hugely successful, since it’s easy and it’s good enough for most people in most situations. This means that we need to have ORMs which do things “wrong”; without them, we’d live in a much less interesting world. Of course, most people in most situations are not all people in all situations, which means we also need to have ORMs which do things “right”; without them, we’d once again live in a much less interesting world.

And, of course, both camps end up searching for the ideal happy medium: the “wrong”-style ORMs either start building in more advanced features or they die off, and the “right”-style ORMs either introduce simpler interfaces or they never have users outside of the niche that’s willing to put up with the way they work. To take the canonical ORM debate from the Python world, this is why Django’s bundled ORM got a huge under-the-hood refit from Malcolm that introduced all sorts of interesting new capabilities (more on that some other day), and why so many people are working steadily on Elixir so that SQLAlchemy will be usable by mere mortals.

I’ll admit that, once upon a time, I had some misgivings about the “wrong” sort of ORM, so I have some idea of where the purists are coming from. But after a couple of years working, day in and day out, with one of the alleged worst offenders, I’ve realized that there’s room for libraries that don’t expose everything your database can do, and that debates about who’s “right” and who’s “wrong” take a back seat to the much more important question of “who’s actually getting shit done?” As it turns out, people on both sides.

On August 4, 2008, Chris Hoeppner said:

A great post, like always.

There’s one simple way to go about this. Do you have any data? If the answer is “Aye, loads of”, then you better take a close look at the data store (aka, RDBM), and then build a useable interface on it. If the answer goes like “Data?”, then you’ll be confident the interface is it all, and it won’t make a difference wether you take Oracle or a plain text file on your servers hard disk.

When developing (note I don’t say “application” or “website”), you’ve only got so much of everything, be it time, people, budget, etc. You have to choose where to invest it. If the data itself isn’t as important as the way the user works on it, taking the Full Fledged Flagship approach is going to give the same results as “I don’t know what SQL is”. And what is it all about? Aye, results.

As you say, the point is “who’s actually getting shit done?”.

And anyway, if you do care soooo much about interfacing with your database engine in the most correct way, you’re most likely writing your own sql into the pointer.

On August 4, 2008, Jon Winstanley said:

Thanks, an excellent article.

I have recently thought about moving to using an ORM, as I build a lot of small web apps. But I can’t see much value in it currently.

Simple SQL does the trick and if someone wants to buy my app for $10m, then hire a guy who can switch it to ORM then that’s there problem.

On August 4, 2008, mike bayer said:

Elixir is great but I’d also like to note that SQLAlchemy includes its own declarative layer now as well, documented here: http://www.sqlalchemy.org/docs/05/plugins.html#plugins_declarative

The extension differs from Elixir in that it does not redefine SQLA’s standard configuration style and also does not generate any table or column names, all that still must be explicit.

In particular, Grok has standardized their relational database plugin around SQLA’s declarative extension.

On August 4, 2008, Steve of AR said:

I’m pretty new to SQLAlchemy but from what I’ve seen so far, I like it. And the reason is, unlike most ORM’s it pays proper respect to SQL itself, which really is a pretty powerful language in its own right. Having a toolkit to help build structured SQL without having to bang together strings is a big win. Having traditional ORM on top of that, when it’s wanted, is a win/win…

On August 4, 2008, Marty Alchin said:

Although practicality beats purity.” — Tim Peters

Because somebody’s gonna quote it, why not me?

On August 4, 2008, Jeff Davis said:

Hi James,

and, to an extent, conflating”

I don’t think I conflated those two issues: I defined a problem (we are essentially using “eval” for DBMSs), and described two approaches to solve it, one of which I strongly favor over the other.

The “meshing” is really a separate issue — my point would stand even if the application language were SQL as well. Perhaps I distracted from my point slightly by digressing about the type mismatches, but even after re-reading, it seemed fairly clear to me.

Why the RDBMS won”

I think this should be changed to “why SQL won”.

that’s OK because it turns out there’s a huge market out there full of people who care about the ‘database’ part much more than they care about the ‘relational’ part, if they even care about that at all.”

I agree completely. I pointed out that combining two factors led to the current state of affairs, one of which was the rich SQL relational semantics. If you don’t need those rich relational semantics, there’s no problem, just use a dumb data store. A first draft of my post emphasized this point with a disclaimer to deflect this type of reaction, but all the disclaimers I would need to use to navigate this minefield would make it impossible to read my post ;)

I think you are putting words in my mouth about “right” and “wrong”. When I used those words, they were directed specifically at approaches that are a means to a end, and that end was clearly specified as combining rich relational semantics and centralization. Nothing in your post refutes the argument that ORMs sacrifice substantial amounts of relational expressive power. Therefore, the “wrong” label is apt for that approach to that problem.

If you redefine the problem to be “I need my SQL DBMS to act like a dumb data store”, an ORM may, indeed be the “right” appraoch.

I have long wanted there to be better dumb data stores available, so that people wouldn’t try make a SQL DBMS into a dumb data store just to get some backup capability (or worse, to fool themselves or others into thinking that they have a relational data design).

On August 4, 2008, tim said:

Programmers don’t even like SQL (or even RDBMSs) that much. I’ve used commercial RDBMSs, and commercial OODBs, and OODBs win, hands down, every time. They’re just so much nicer to use!

Of course, everybody starts small, and the established good free databases are virtually all SQL (db4o is relatively new). You can’t blame somebody for picking SQL today.

But bottom line: programmers only like RDBMSs for their ACID. SQL purists might be upset we’re not using SQL “right”, but we don’t even want SQL.

On August 4, 2008, D Nielsen said:

There aren’t just two approaches to data(base) integration with an application, so the choice is more complicated than just right vs. wrong or ORM vs. RDBMS. For a concrete example, we’re building a spatially-enabled web interface that includes interactions with task tracking and budgeting databases as well as a moderately complex environmental database. Django’s ORM is being used for all the data related to the user interface and for the simpler databases. Dynamically created SQL is being used for access to the environmental data (where the database includes custom types, aggregate functions, operators, and other complexities). Some of the spatial data are stored in the environmental database and some are stored as disk files. That’s a practical approach for us. It’s neither right nor wrong; it’s both ORM and RDBMS. And it’s certainly not the only practical approach—there are a million of ‘em.

On August 4, 2008, Frank Schmitt said:

The advantage of the “dumb database” ORM approach is that your business logic (i.e. the model layer in an MVC setup) can be in the same language as your controller and (potentially) view layers.

It’s certainly possible to put that business logic in the database (perhaps even as stored procedures if and when you run up against the limitations of straight SQL), but you’ll probably be writing it in some half-baked proprietary language. And because the stored procedures need to be reloaded every time the business logic or database schema changes, your deployment becomes more cumbersome. But the real kicker is when you inevitably end up having to implement the same functionality in your both your application language and said proprietary language, where you end up with all the usual pitfalls of copy-and-paste programming with the added bonus of incompatible language features.

One of the nicest features of Ruby on Rails is that pretty much everything is in one language, even to some extent that much of your JavaScript can be written in RJS. It’s also a perfectly cromulent scripting language, suitable for cron jobs, unlike the kludgy workarounds needed, say in the pre-cli PHP days.

On August 4, 2008, K Thompson said:

Great article, but you make yourself sound silly when you say “Impedance mismatch”, a term that comes from electrical engineering, describing the situation where consecutive stages of a power amplifier or distribution system don’t have matched impedances (resistances) an thus max power transfer is reduced, and has very little to do with software engineering. Both uses speak to mismatched interfaces, but in completely different realms.

A more apt and perfectly useful phrase might be disparity, incompatibility, mismatched interfaces, or modality, .. or a million others.

For those curious about the electrical engineering term, this is why you must buy speakers that have the same impedance as your amplifier: Think of it like trying to throw various types of baseballs - a wiffle ball is so light (low impedance) that you can barely impart any energy upon it when you try to throw it; a canon ball is much heavier (high impedance), but so much heavier that you have a hard time moving it, never mind throwing it. A regulation baseball is just the right weight for you to transfer as much energy as possible to it.

On August 4, 2008, Mike said:

Great Article, I really enjoyed reading.

@K Thompson: http://en.wikipedia.org/wiki/Object-Relational_impedance_mismatch

I guess the rest is self-explanatory. James did not invent this phrase, he just used it.

On August 4, 2008, Robert said:

Your article was featured on HN: http://news.ycombinator.com/item?id=266396

On August 5, 2008, Greg Jorgensen said:

You’ve glossed over some facts. The success of relational databases can’t be waved away as an accident. There were competing data storage and retrieval tools around before RDBMSs. Relational databases stand almost alone in the software world because the relational model is based on real math: sets, predicate logic, relational algebra. There is no similarly rigorous and useful theoretical model for object oriented databases. RDBMSs rub programmers the wrong way because they aren’t used to right and wrong being anything more than differences of opinion. Relational theory can be used to prove that a data model is flawed, and it’s not just the opinion of the straw-man army of “purists” you poke fun at.

I’ve been programming longer than RDBMSs have been around. There was no marketing hype, certainly nothing even close to the Java or Ruby on Rails hype machine, or the constant “it’s almost ready” hype around OODBMSs. Programmers who had experience with large databases built on non-relational technologies figured out that the relational model solved a lot of pretty big problems. I don’t mean to condescend but the majority of programmers today are too new to know that the alternatives to the relational model are old ideas that died out. ACID properties and Codd’s laws are not mumbo-jumbo that out-of-touch “relational geeks” and “database purists” use to make fun of newbie programmers. They are well-defined and usable rules for maintaining data integrity, identifying problems, and fixing them. They don’t apply to every problem, but it’s lazy and ignorant programmers who are shoehorning every data storage problem into an RDBMS, not the relational elite forcing one solution down everyone’s throat. Excuse me if I chuckle watching someone try to wring ACID out of an ActiveRecord ORM, or do transactions and joins in application code.

The mismatch is not between your imaginary relational database priesthood and ordinary programmers just trying to do real work. The line is between programmers who have actually worked with multiple large-scale applications built around a common database, and those who have only worked on low-volume applications with trivial data models. I have many times replaced hundreds of lines of slow procedural or ORM code with a single SQL statement that was both faster and more likely to stay correct. You won’t persuade me that the programmers who sneer at SQL and RDBMSs and spend days tinkering with joins in an ORM are getting work done faster than someone who actually knows how to query a database.

I wrote my own article on this subject a while back. Chris Date has written a couple of books on the history of the relational model and RDBMSs, and I have links to those at the end of my article.

Greg Jorgensen typicalprogrammer.com

On August 5, 2008, Matthew W said:

I agree with Greg (and disagree with the post) that relational databases were a mere accident. Relational theory is, and always be, the fundamental and essential tool for thinking about and analysing data modelling, database implementation etc.

Relational DBMSes are the closest to this theory, which is both an advantage and a disadvantage.

The problem is that relational database theory essentially aims to abstract away entirely the physical representation of data from the logical representation. This is a Very Hard Problem in its full generality, and one that’s not been addressed fully in available implementations. A true relational database is setting itself a very difficult set of goals in that it needs to:

To achieve all three of these in full generality is a Very Hard Problem, and not one that’s been solved satisfactorily in widely available or widely-taught DBMS products.

Until some very very smart people start innovating in this area and making the results of their work into a robust, scalable, accessible, ideally open-sourced solution - I think there’ll still be a big market for the likes of SimpleDB and clueless ORMs ontop of MySQL. Although this makes me sad. I’m not saying I expect anyone to do that work for free. I’m just saying.

The likes of SimpleDB etc, dumb ‘cloud computing’ distibuted data stores, are trying to address part of this problem. But the lack of abstraction of the physical storage model takes away a lot of the advantages of a relational database.

On August 5, 2008, Matthew W said:

Finally, another important problem for using the relational model in application development - while it attempts (at times somewhat successfully) to abstract away the logical schema from physical storage and indexing at the Database level, at the application level it’s a different story. There are very few (no?) application-development-oriented languages which support the use of a fully normalised logical schema directly in application programming. Instead, application languages (C#, Java, Python, Ruby, C++, pretty much any OO language) all depend upon denormalized data storage and access models in application code. You simply have to worry about the physical data model when writing OO application code.

This could be addressed with a radically different programming paradigm for application development which abstracts away the physical data representation - but again, there’s a shortage of research in the area leading to practical, robust accessible implementations. Again, not saying I expect anyone to do this for free, but we have a problem.

On August 5, 2008, Matt said:

Matthew W: Yeah, ORMs are an attempt to extend that lack-of-worrying-about-the-phyiscal-model a little futher into application development territory - but the mismatch ultimately bites you on the arse at one point or another.

I’ve always been curious about how an application development language built on Logic programming (prolog etc) would fare at integration with a logical data storage engine. I suspect a lot better.

On August 5, 2008, J said:

Have a look at

http://www.readwriteweb.com/archives/amazon_dynamo.php

http://www.readwriteweb.com/images/relational-db-scalability.jpg

which are quite relevant to the debate here.

Greg Jorgensen: Would be interested in your response to these.

On August 5, 2008, Greg Jorgensen said:

J:

Dynamo is not addressing a problem with the relational model; it is addressing what Matthew W called the “very hard problem” or dealing with parallelism and concurrency with distributed databases. Amazon is dealing with server and networking problems, not real or perceived problems with the relational model. I contrast that to OODBMSs, which offer a non-relational data storage and retrieval model that meshes better with OOP (and with programmers who can’t deal with SQL).

I’ll start by pointing out that S3 has already this year suffered outages that would be unacceptable if they happened to a bank or an insurance company. That probably says more about why Amazon is doing so much work in this area than about their ability to deal with the hard problems they encounter at their scale.

The scalability problems sketched out in the diagram are real, but the author glosses over them as if the average programmer’s experience with MySQL is the whole story. Oracle, MS SQL Server, and DB/2 (and others) have had solutions to these problems for a long time. That isn’t to say the existing solutions are perfect, but large-scale distributed databases are not exotic or new in the enterprise world.

On August 5, 2008, Kevin Teague said:

About those “ORM packages available for modern frameworks” regarding the statement “They’re huge and complex, which means that a lot of development effort has to go into building simpler, friendlier interfaces to them.”

This isn’t true for all cases - for example the megrok.rdb package for Grok provides a “simpler, friendlier interface” for SQL Alchemy integration into Grok, but the package itself (http://svn.zope.org/megrok.rdb/trunk/) is less than 200 lines of code, hardly a lot of development effort. This means that in Grok your Model class can now inherit from megrok.rdb.Model and do things the “wrong way as a simple interface on top of the right way” or inherit from grok.Model and simply do persistence the “wrong” way :)

On August 6, 2008, Manakel said:

I really don’t get the point out there.

It’s really a question of tool for the job. If you are alone (in term of data) or with small dataset or with data that can be read meaningfully by an human being to check for inconsistency, then you can go with current “Dumb DataStore”.

If you have “BIG” volumes or small volumes but BIG Number of project s to report accross or if your data is complex then you need a Relational DB or a BI DB.

But it’s true that RDBMS, due to the mathematical theory behind, offer some quick and efficient to detect flaw design. We lack such properties on other models..

On August 8, 2008, Clayton Nash said:

Greg nailed pretty much all my points - if you don’t want ACID then RDBMs are overkill for you and you should just use any old object store you feel like. It’s just that ACID turns out to be exactly what you need a lot of the time. I really don’t understand when it became OK for someone to become a professional in the computer arena without understanding, or being educated in, the basic tools that make up our world. IBM had a rich set of tools for dealing with hierarchical data in the 60s and 70s as did others in the space. Oracle and the RDBMs that followed simply out performed everything else in existence at the time and continue to do so. They solve a set of problems that are important. Amazon and Google data stores are solving a different problem - a very useful one to be fair. You just couldn’t use either system to run a bank or an airline or even a small business accounting system. “We’re sorry we double booked seat 21C on your flight, but you see our database doesn’t guarantee global data coherence” (actually, this would actually be a better excuse than some the airlines are using). Programmers fail not at choosing which tool is appropriate, but in not understanding what they’re giving up by not using an RDBMs.

Comments for this entry are closed. If you'd like to share your thoughts on this entry with me, please contact me directly.

ponybadge