Semantics in HTML 5

I’m going to make a bold prediction. Long after you and I are gone, HTML will still be around. Not just in billions of archived pages from our era, but as a living, breathing entity. Too much effort, energy, and investment has gone into developing the web’s tools, protocols, and platforms for it to be abandoned lightly, if indeed at all.

Article Continues Below

Let’s stop to consider our responsibility. By an accident of history, we are associated with the development of an important tool our civilization will use to communicate for decades to come. So, when we turn our minds, idly or in earnest, to improving HTML, we must understand just how far-reaching the ramifications of today’s decisions may be.

HTML 5, the W3C’s recently redoubled effort to shape the next generation of HTML, has, over the last year or so, taken on considerable momentum. It is an enormous project, covering not simply the structure of HTML, but also parsing models, error-handling models, the DOM, algorithms for resource fetching, media content, 2D drawing, data templating, security models, page loading models, client-side data storage, and more.

There are also revisions to the structure, syntax, and semantics of HTML, some of which Lachlan Hunt covered in “A Preview of HTML 5.”

But for this article, let’s turn solely to the semantics of HTML. It’s something I’ve been interested in for many years, and something which I believe is fundamentally important to the future of HTML.

The BBC recently announced that they would drop the hCalendar microformat from their program listings, due to accessibility and usability concerns with the abbr design pattern. This demonstrates that we have, beyond any doubt, pushed the semantic capability of HTML far past what was ever intended, and indeed, what is reasonably possible with the language. We have simply run out of HTML elements and attributes with which to mark up more richly semantic documents. If we continue to be clever with the existing constructs of HTML, more problems such as this will arise. But HTML suffers from a fundamental defect as a semantic markup language—its semantics are fixed, not extensible.

This is not simply a theoretical problem. Hundreds of thousands of developers use the class and id attributes of HTML to create more richly semantic markup. (They also use them as “hooks” for CSS styling, but that’s another matter.) Almost invariably, those developers use ad hoc vocabularies—that is, values they have made up, rather than values taken from existing schemas. It’s pseudo semantic markup at best.

Many pages around the web use microformats to add more structured semantics than available in HTML’s impoverished set of elements and attributes. In this case, the values used for the class attribute come from agreed-upon vocabularies, sometimes adopted from other standards, such as vCard, sometimes from newly minted vocabularies where no solid pre-existing standard exists (as is the case for hReview).

Extensible semantics#section2

There is a very real problem that needs to be solved here. We need mechanisms in HTML that clearly and unambiguously enable developers to add richer, more meaningful semantics—not pseudo semantics—to their markup. This is perhaps the single most pressing goal for the HTML 5 project.

But it’s not as simple as coming up with a mechanism to create richer semantics in HTML content: there are significant constraints on any solution. Perhaps the biggest one is backward compatibility. The solution can’t break the hundreds of millions of browsing devices in use today, which will continue to be used for years to come. Any solution that isn’t backward compatible won’t be widely adopted by developers for fear of excluding readers. It will quickly wither on the vine.

The solution must be forward compatible as well. Not in the sense that it must work in future browsers—that’s the responsibility of browser developers—but it must be extensible. We can’t expect any single solution we develop right now to solve all imaginable and unimaginable future semantic needs. We can develop a solution that can be extended to help meet future needs as they arise.

These two constraints in tandem, present a huge challenge. But in the context of a language whose major iterations arrive a decade apart, and whose importance as a global platform for communication is paramount, this is a challenge that must be solved.

So, how is HTML 5 addressing this issue? HTML 5 introduces a number of new elements. Some of these are what I’ve termed “structural”—section, nav, aside, header, and footer. The dialog element is a kind of content element, akin to blockquote. There are also a number of data elements, such as meter, which “represents a scalar measurement within a known range, or a fractional value; for example disk usage,” and the time element, which represents a date and/or a time.

While these elements might be useful, and seem to have generated some interest, do they really solve the problem we’ve identified, particularly within the twin constraints of forward and backward compatibility?

Let’s consider each constraint.

Backward compatibility#section3

How do current browsers handle these new elements, such as section? Well, the most recent versions of Safari, Opera, Mozilla, and even IE7 will all render a page as follows.

<h1>Top Level Heading</h1> <section>
   <h1>Second Level Heading</h1>
   <p>this is text in a section element</p>   <section>
    <h1>Third Level Heading</h1>
   </section>
 </section>

It looks like an excellent start. But when we try styling, for example, section elements with CSS that looks like this:

section {color: red}

…most of the above-mentioned browsers manage to style the element, but IE7 (and so presumably 6) do not.

So we have a serious backward compatibility issue with 75% of browsers currently in use. Given the half-life of Internet Explorer, we can predict that most users will be using IE6 or IE7 even several years from now.

If HTML 5 introduces these new elements, what is the likelihood they’ll be implemented by the vast majority of developers—given the knowledge that they’re essentially incompatible with the majority of browsers in use?

Unfortunately, if you are looking for alternative solutions to the CSS problem, putting class attributes on your section elements and then trying to style them using the class value won’t work in IE. Perhaps there is some kind of workaround out there, but unless there is, that looks like a deal breaker right there.

Let’s turn to forward compatibility, the second constraint.

Forward compatibility#section4

We’ll start by posing the question: “why are we inventing these new elements?” A reasonable answer would be: “because HTML lacks semantic richness, and by adding these elements, we increase the semantic richness of HTML—that can’t be bad, can it?”

By adding these elements, we are addressing the need for greater semantic capability in HTML, but only within a narrow scope. No matter how many elements we bolt on, we will always think of more semantic goodness to add to HTML. And so, having added as many new elements as we like, we still won’t have solved the problem. We don’t need to add specific terms to the vocabulary of HTML, we need to add a mechanism that allows semantic richness to be added to a document as required. In technical terms, we need to make HTML extensible. HTML 5 proposes no mechanism for extensibility.

HTML 5, therefore, implements a feature that breaks a sizable percentage of current browsers, and doesn’t really allow us to add richer semantics to the language at all.

Several questions remain about the new elements. Where have these new element names come from? How was it decided that there should be a navigation element, and that it should be called “nav”? Why should the same term apply to page-level, site-level, and meta-site-level navigation?

Why not adopt an existing vocabulary, such as Docbook? Its document structure vocabulary is far richer and it’s been developed by publishing experts over many years. This is not an argument in favor of Docbook, specifically: the point is that the extremely important task of providing a mechanism for semantic richness in HTML is being approached in an ad hoc way, paying apparently little attention to best practices in related work going back 30 years or more. (The original work on GML began in the early 1970s.)

Some thoughts on a solution#section5

So, having been critical of current efforts, do I have any practical suggestions on how to solve this problem? Well, I have the start of one.

If adding elements to HTML is out of the question, at least within the parameters of this discussion, attributes are the other logical area of HTML to concentrate on. After all, for nearly a decade, we’ve been using class and id attributes as mechanisms to extend the semantics of HTML. A great many developers are familiar and comfortable with this. The microformats project demonstrated that the existing attributes of HTML are not sufficient, as a generalized mechanism, to extend the semantics of HTML. So, if we are to use attributes to help solve this problem, we need to come up with one or more new attributes. Before we get into the mechanics of how that might work, it’s only fair to subject this suggestion to the same requirements we have for the new elements of HTML 5. Most importantly, is introducing new attributes to HTML backward compatible? And if so, does it provide a workable mechanism for semantic extensibility in HTML?

Let’s invent a new attribute. I’ll call it “structure,” but the particular name isn’t important. We can use it like this:

<div structure=“header”>

Let’s see how our browsers fare with this.

Of course, all our browsers will style this element with CSS.

div {color: red}

But how about this?

div[structure] {font-weight: bold}

In fact, almost all browsers, including IE7, style the div with an attribute of structure, even if there is no such thing as the structure attribute! Sadly, our luck runs out there, as IE6 does not. But we can use the attribute in HTML and have all existing browsers recognize it. We can even use CSS to style our HTML using the attribute in all modern browsers. And, if we want a workaround for older browsers, we can add a class value to the element for styling. Compare this with the HTML 5 solution, which adds new elements that cannot be styled in Internet Explorer 6 or 7 and you’ll see that this is definitely a more backward-compatible solution.

Extensibility through attributes#section6

Instead of new elements, HTML 5 should adopt a number of new attributes. Each of these attributes would relate to a category or type of semantics. For example, as I’ve detailed in another article, HTML includes structural semantics, rhetorical semantics, role semantics (adopted from XHTML), and other classes or categories of semantics.

These new attributes could then be used much as the class attribute is used: to attach to an element semantics that describe the nature of the element, or to add metadata about the element.

This is not dissimilar to the role attribute of XHTML, but rather than having a single attribute “bucket” for all element semantics, we should identify the different types of semantics for an element, and separate them out.

For example, the XHTML role attribute works like this:

<ul role="navigation sitemap">
    <li href="downloads">Downloads</li>
    <li href="docs">Documentation</li>
    <li href="news">News</li></ul>

The values of the role attribute are a space-separated list of words from the default vocabulary, or from a defined vocabulary.

Why not simply adopt the role attribute as-is? Well, there are other kinds of semantics for which the term role doesn’t apply. For example:

<p rhetoric="irony">He’s a fantastic person.</p>

This demonstrates a theoretical type of semantics—“rhetoric,” which could be used to markup the rhetorical nature of a document. This element clearly doesn’t play the role of irony in the document. Rather, the contents of the element are ironic.

Here is another example. It’s increasingly obvious that HTML lacks a way to attach a machine readable version of a humanly readable value, e.g., a date. This is at the heart of the problem the BBC has with the hCalendar microformat that we referred to earlier. While <span role=“2009-05-01”>May Day next year really doesn’t make sense, something along the lines of <span equivalent=“2009-05-01”>May Day next year would.

Again, whether we use the specific term “equivalent” or some other term for this kind of semantic attribute is not the issue. What’s important to note is that it’s not as simple as using either the class attribute or the role attribute as a one-size-fits-all bucket to hold semantic information. For a properly extensible solution that provides backward compatibility and sufficient flexibility, a solution along these lines looks worth investigating.

I titled this section “some thoughts on a solution” because a significant amount of work needs to be done to really develop a workable solution. Open questions include the following.

How many distinct semantic attributes should there be? Should these categories be extensible, and if so, how?
How are vocabularies determined?
Do we simply invent the terms we want, in much the same way that developers have been using class values, or should the possible values all be determined by a standardized specification? Or should there be a mechanism for inventing (and hopefully sharing) vocabularies, using some kind of profile?
If we have a conflict between two vocabularies, such that two identical terms are defined by two different vocabularies, how is this resolved?
Do we need a form of name spacing, or does some other mechanism exist?

Rather than rushing to answer these questions, I’m posing them to highlight the issues that need to be addressed, and to start a dialog. The ramifications and reach of decisions made in HTML 5 are too great for decisions to be made in the absence of at least some input from those highly knowledgeable about linguistics, semantics, semiotics, and related fields.

Hopefully, if nothing else, then it’s clear that simply “making up new elements” isn’t a solution to how to increase the semantic capacity of HTML.

Let’s not rush into these decisions lightly—after all, with climate change we’ve saddled our grandkids with enough trouble as it is. Let’s at least leave them the best possible HTML we can.

107 Reader Comments

Rob Burns says:

January 24, 2009 at 5:52 am

@Aaron Miller
bq. I think the phrase tag soup should be retired. It’s one thing to talk about parsing it, another to talk about writing it. It’s important to remember that proper HTML is never “tag soup.”? It is however, not necessarily XHTML, and if parsed as such, will be “tag soup.”? The reverse is not true. This is precisely what pisses off so many XML purists, because it’s a one-way street.

Tag soup gets used in two different ways which you’re confusing in this comment. 1) tag soup sometimes refers to the serialized source content of a document where tags are potentially misnested, content models invented out of thin air and attribute values requiring quotations not quoted: in general vended content not conforming to any specification anywhere. 2) tag soup parsing refers to a parser that is capable of parsing tag soup (a Herculean task).

When you say that XHTML parsed as text/html will be tag soup you’re confusing these two definitions. The XHTML is certainly not tag soup as it adheres to the XHTML syntax and sometimes even other syntactic requirements on top of that (such as XHTML 1.0 appendix C), so there is no sense in which that content can be considered tag soup. However, if such an XHTML document is vended as text/html it will be parsed by the UA’s tag soup parsing just like any other conforming or non-conforming HTML 2-4.0.1. So in this sense both: not using XHTML; and not vending as applicaiton/xhtml+xml means that the content is parsed by the tag soup parsing processor (just like any other HTML).

It’s important to keep these two meanings of tag soup separate to understand the conversation.
Aaron Miller says:

January 29, 2009 at 3:07 pm

@Rob, looks like we’re talking about the same distinction. The only difference is that I’m saying the phrase “tag soup” makes it sound anomalous, when in fact from the content side it refers to over 95% of the web, and from the browser (parser) side, it’s SOP. See the Opera MAMA study and Ian Hickson’s Google report if you don’t know what I mean.
Russ Michell says:

February 24, 2009 at 6:33 pm

This is not directly related to John’s article, but it reminded me of the following problem I’ve been posing in my head for some time: How is that the syntax of HTML or any other machine-readable grammar is constructed using English? More specifically US English? Has anyone ever tried to construct a language, of even a very light grammar, that allows multi-[Human]languages to describe headers, footers, loops, lists etc? I appreciate that many of these machine-language were first composed in the US and thus US-English has become the Lingua-Franca of programming – but this is the 21st century and not everyone on the Planet who wishes to write code knows how to speak/write English never mind to a specific sub-grammar of it.
Montmorency says:

June 27, 2009 at 1:46 pm

Seems that the first one wasn’t perfect.
Here it is: http://interpretor.ru/html5semantics
Tobias Otte says:

July 13, 2009 at 6:52 pm

“German translation available”:http://tobias-otte.de/essays/semantik-in-html-5/
blackdog says:

November 9, 2009 at 1:31 pm

I agree with all the principal points in the article, i have the same opinion about a preferrable use of attributes; but i think they’re breaking compatibility on purpose, we all agree it is stupid to still have concerns about IE6 in 2009 (and soon ’10). If they break the cordon everybody will be happier. And in fact the big push on HTML5 came from browser makers, and looks to me MS wants to be in the game.
In the aftermath we will all have a common base to discuss upon.

Afterall i think some new tags would come in handy.
For eg i think that for something as ubiquitous as a calendar, there should exist a tag, it would end the debate wich solution is more semantic (table vs list, that oddly relies on the kind of visualization we want to give), it would spare a lot of code and give more artistic freedom to designers that could target a parameter/class with a simple javascript to radically modify the visualization.
Tchalvakspam says:

January 3, 2010 at 12:43 am

http://wiki.whatwg.org/wiki/FAQ#HTML5_should_support_a_way_for_anyone_to_invent_new_elements.21

Contains some of their responses to the extensibility problem.