Why HTML

June 18, 2008 Meta, Pedantics, Web standards

So, as I let the dust settle from the most controversial changes I made in the redesign (and tweak some things and watch my stats in response to the constructive feedback I’ve gotten), I’d like to address the other big change that people have been asking about: why I switched (switched back, actually) from XHTML 1.0 to HTML 4.01.

The short and sweet reason is simply this: XHTML offers no compelling advantage — to me — over HTML, but even if it did it would also offer increased complexity and uncertainty that make it unappealing to me.

Bear with me and I’ll explain why.

Why use XHTML?

XHTML really only offers two major advantages — to a document author — which aren’t available in HTML: the ability to mix in content from other markup languages more or less seamlessly (via XML namespaces), and the ability to use standard XML tools to work with your documents.

Note that I’ve left out the big “advantage” (or disadvantage, depending on your point of view) that generates most of the noise in typical HTML-vs.-XHTML debates: XML’s draconian error-handling policy, which promotes nearly all errors to fatal errors. I don’t see any real advantage in applying that model to the Web, and I see a host of potential disadvantages, but the arguments have been so well-trod at this point that I don’t honestly think I’d convince anybody one way or another by bringing them all up again.

Meanwhile, I don’t have any need for namespaced content; I’m not displaying any complex mathematical notation here and don’t plan to, and I don’t use SVG for any images. So that’s one advantage of XHTML out the window. I also don’t have any need for XML tools; all the processing I need to do can be handled by HTML-parsing libraries like BeautifulSoup. That’s the other advantage gone.

So there’s no concrete advantage to be gained, for me, here on this site, by using XHTML.

Why not use XHTML?

Once again, I don’t want to argue for or against XML’s draconian error-handling model, but I do think the consequences of that draconian error handling, combined with some odd quirks of XML and its intersection with HTML in the form of XHTML, result in some significant disadvantages.

First off, let me point out that if you have a need for the features XHTML offers, the other big traditional HTML-vs.-XHTML debate — concerning which media type should be used for XHTML documents — becomes moot. If you want the behavior of XHTML, you have no choice but to handle it as XHTML, using either the recommended XTHML media type of application/xhtml+xml (vastly preferable) or a more general XML media type like application/xml. The reasoning for this is simple: sending a document as text/html, even if its DOCTYPE indicates that it’s XHTML, won’t get you any XML or XHTML behaviors. So if you’re going to do XHTML, you have to actually do XHTML.

But the moment you do that, your job as a document creator gets a lot more complicated. First off are the (relatively) well-known changes to the way CSS and the DOM work in XHTML documents which are actually being handled as such: the rules for applying styles are subtly different (you need to apply your document-wide styles to the html element instead of the body element), and any DOM methods intended to work with or manipulate the XHTML portions of the document must be changed to namespaced versions and fed the XHTML namespace as an argument (since a true XHTML document may well have content from multiple XML namespaces).

If you have to support Internet Explorer, this is a nightmare; IE has a long history of lackluster support for handling XHTML-as-XHTML, and so you typically have to fall back to text/html. Now your document has to work under either of two different style and object models, and that’s even less fun.

Now, on top of this add the consequences of XML. You don’t just have to ensure your XHTML is well-formed, you also have to be aware of some quirks which can cause a document to be well-formed and properly interpreted in one context, but non-well-formed or misinterpreted in another.

For example, many browsers use non-validating XML parsers, which means they are not required to recognize or correctly handle any named character entities other than the five predefined by XML itself: amp, quo, apos, lt and gt. These days the browsers mostly do some special-case processing to ensure they properly recognize and handle entities from XHTML’s externally-defined entity set, but there was a time when, for example, Firefox would toss an “unrecognized entity” error at you if you used any externally-defined named entities (such as nbsp or copy or other commonly-used entities) in an XHTML document (arguably a bug for any document which was not declared to be standalone).

Although popular web browsers can now handle this situation correctly (from the point of view of a Web author), there’s no guarantee that other tools which use non-validating parsers (which includes a vast swathe of popular XML libraries across most mainstream languages) will do the same. This raises a very real risk that an XHTML document, when fed to arbitrary XML tools, will not be interpreted or handled as intended.

Similarly, the Byzantine set of rules which govern the handling of character-encoding information when XHTML documents are sent over HTTP opens up ample opportunities for “quirky” behavior. It’s quite easy, for example, to produce an XHTML document which is well-formed when read from a local disk but not well-formed when served over HTTP using common default settings for web servers. Worse: some XML tools will correctly recognize the well-formedness (or lack thereof) of such a document, while others will not, introducing uncertainty as to how your documents will be handled.

And this is really just the tip of the iceberg; the intersection of HTML, XML, HTTP, CSS and the DOM is a pedant’s paradise, and successfully navigating those waters involves a lot more work than I’m willing to do just to publish a blog.

Why HTML

Since XHTML doesn’t offer any advantages over HTML for what I’m doing here, and that using XHTML brings up a whole host of potential issues, It’s a less than desirable solution. Meanwhile, HTML does everything I need, and it does not fall victim to any of XHTML’s potential problems: it has one style model and one document model, its semantics are not selectively applied according to the type of parser in use and serving HTML over HTTP involves fewer specifications and thus fewer opportunities for frustrating “quirks”.

So here I am, publishing valid HTML 4.01 Strict. As HTML5 — both the specifications and implementations — come to maturity over the next couple of years, I’ll probably migrate in that direction. For now, though, I have HTML 4.01 and I’m happy with it.