RDFa (“Resource Description Framework in attributes”) is having its five minutes of fame: Google is beginning to process RDFa and Microformats as it indexes websites, using the parsed data to enhance the display of search results with “rich snippets.” Yahoo!, meanwhile, has been processing RDFa for about a year. With these two giants of search on the same trajectory, a new kind of web is closer than ever before.
The web is designed to be consumed by humans, and much of the rich, useful information our websites contain, is inaccessible to machines. People can cope with all sorts of variations in layout, spelling, capitalization, color, position, and so on, and still absorb the intended meaning from the page. Machines, on the other hand, need some help.
A new kind of web—a semantic web—would be made up of information marked up in such a way that software can also easily understand it. Before considering how we might achieve such a web, let’s look at what we might be able to do with it.
Adding machine-friendly data to a web page improves our ability to search. Imagine a news story that says “today the prime minister flew to Australia,” in reference to Britain’s prime minister, Gordon Brown. The article might not call the prime minister by name, but it’s still pretty easy to ensure that this news story shows up when someone searches for “Gordon Brown.”
If the news story in question dates from 1940, however, we wouldn’t want this document to appear when users search for “Gordon Brown”—but we would want it to appear when they search for “Winston Churchill.”
To accomplish this using the same technique as the Gordon Brown example—i.e., by mapping one set of words to another—our search engine must know the start and end dates of the premierships of all British prime ministers, and then cross-reference those with the publication date of the newspaper article. This wouldn’t be completely impossible, but what if the article is a piece of fiction, or if it’s actually about the Australian prime minister? In these cases, a simple list of dates won’t help us.
The indexing algorithms that try to deduce necessary context from the text are sure to improve in the coming years, but extra markup that makes information unambiguous can only make search more accurate.
Improved user interfaces
Yahoo! and Google have both begun to use RDFa to improve user experience by enhancing the appearance of individual search results. Here’s Google’s approach:
A rich snippet on Google.
…and here’s Yahoo!’s:
An enhanced results example on Yahoo!
There’s a commercial advantage to having a better “understanding” of the pages being indexed: more relevant, focused advertisements can be placed alongside search results.
Now that we know why we might want to put more machine-friendly data in our pages, we can ask how we might go about it.
HTML’s metadata features
You’ll no doubt already be familiar with the basic metadata features that HTML supports. The most commonly used are the
link elements, and some people will also be aware that the
@rel attribute used on
link can also be used with
a. (Note: I’ll be using the term “HTML” to mean “the HTML family of languages,” since what I’m saying applies equally to both HTML and XHTML.)
We’ll look at these existing features first, because they provide the conceptual foundation upon which RDFa has been built.
The HTML use of
link elements live in the head of a document, and allow us to provide information that relates to that document. For example, I might want to say that I created my document on May 9th, 2009, that I am the author, and that I give other people the right to use the article however they want:
(Line wraps marked » —Ed.)
<html> <head> <title>RDFa: Now everyone can have an API</title> <meta name="author" content="Mark Birbeck" /> <meta name="created" content="2009-05-09" /> <link rel="license" href="http://creativecommons.org/licenses/ »
by-sa/3.0/" /> </head> . . . </html>
This example shows how HTML neatly packs the document’s metadata into a space distinct from the document’s text. HTML uses the
head element for metadata and the
body element for whatever content the web page contains.
HTML also allows us to blur these two areas: we can place the
@rel attribute on a clickable link, yet retain the meaning that it contains in
Imagine I want to allow my site visitors to view my Creative Commons license. As things stand, the information about which license I’m referring to is hidden from readers because it’s in the
head. But that’s easily addressed by adding an anchor in the
<a href="http://creativecommons.org/licenses/by-sa/3.0/"> CC Attribution-ShareAlike</a>
This is fine, and it allows us to achieve our goals: first, we have machine-ready metadata in the
head that describes the relationship between the document and the license:
<link rel="license" href="http://creativecommons.org/licenses/ »
…and second, we have a link in the
body, that allows a human to click through and read the license:
<a href="http://creativecommons.org/licenses/by-sa/3.0/"> CC Attribution-ShareAlike</a>
But HTML also allows us to use the
@rel attribute of
link on an anchor. In other words, it allows metadata that would normally go into the
head of the document to appear in the
With this incredibly powerful technique, we can express both the metadata for machines, and the clickable link for humans, in one convenient package:
<a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/"> CC Attribution-ShareAlike</a>
This simple method of augmenting inline markup with metadata is not often used in web pages, but it’s right at the heart of RDFa. This leads to the first principle of RDFa:
a elements imply that there is a relationship between the current document and some other document; the
@rel attribute allows us to provide a value that will better describe that relationship.
Don’t forget though: using
a is merely taking advantage of an already existing HTML feature, which RDFa then draws attention to.
Applying distinct licenses to images
The previous example provides licensing information about the web page that contains it. But what if the page contains multiple items, each of which has a different license? It doesn’t take more than a moment to think up scenarios where this would apply, such as a page of search results on Flickr, YouTube, or SlideShare.
RDFa takes the simple idea behind
@rel—that it expresses a relationship between two things—and builds on it, by allowing the attribute to be applied to the
@src attribute on the
So, for example, imagine a page of search results on Flickr:
<img src="image1.png" /> <img src="image2.png" />
Let’s say that the first image is licensed with the Creative Commons Attribution-ShareAlike license, but that the second uses CC’s Attribution-Noncommercial-No Derivative works license.
How should we mark it up?
If you guessed that we simply place the
@rel attribute on the
img tag, then you are exactly right. To express two
different licenses, one for each image, we simply do this:
<img src="image1.png" rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/" /> <img src="image2.png" rel="license" href="http://creativecommons.org/licenses/ »
Here, you can see the core principle in action—incrementally building on the metadata features that HTML already provides. Building on HTML concepts in this way makes it easier for people to orient themselves when using RDFa.
@href attributes are no longer confined to the
link elements, but can also be used on
img to indicate a relationship between the image and some other item.
Adding properties to the
In our HTML illustration, we saw that we can also add textual properties about the document:
<meta name="author" content="Mark Birbeck" /> <meta name="created" content="2009-05-01" />
This tells us who created the document, and when, but it can only be used in the head of the document. RDFa takes this technique and embellishes it so that it can be used in
@content is therefore no longer confined to the
meta tag, but can appear on any element.
In ordinary HTML, properties are set in the head of the document, using
meta. In HTML documents with RDFa,
@content can be used to set properties on any element.
There is a minor change from the way
@content is used in
head though, which is that since the
@name attribute is already used for a different purpose in other parts of HTML, it would get a little confusing to also use it to represent the property name in the
body. RDFa therefore provides a new attribute, called
@property, to play this role.
Although HTML uses the
@name property to set the name of a property on
meta, it can’t be used on other elements, so RDFa provides a new attribute called
Suppose our document’s publication date and author name are in the head of the document, and that the same information is in human-readable form in the body of the document:
<html> <head> <title>RDFa: Now everyone can have an API</title> <meta name="author" content="Mark Birbeck" /> <meta name="created" content="2009-05-09" /> </head> <body> <h1>RDFa: Now everyone can have an API</h1> Author: <em>Mark Birbeck</em> Created: <em>May 9th, 2009</em> </body> </html>
With RDFa we can coalesce these two sets of information, so that the metadata is located at the same point as the readable text:
<html> <head> <title>RDFa: Now everyone can have an API</title> </head> <body> <h1>RDFa: Now everyone can have an API</h1> Author: <em property="author" content="Mark Birbeck"> Mark Birbeck</em> Published: <em property="created" content="2009-05-09"> May 14th, 2009</em> </body> </html>
We’ll see in a moment how we can improve on this example. For now we just need to recognize that whether the metadata appears in the body of the document or the head, it means the same thing—and that this is merely the text property equivalent of the
@rel technique that HTML already has for expressing relationships in
We have to take a small diversion here. We can get away with using
@name="author" in the document head because even though the property “author” is not defined in any specification, over the years people have come to expect it. But RDFa allows—and requires—much greater precision. When we use a term such as “author” or “created,” we need to indicate where that term comes from. If we don’t, we have no way to know if what you mean by “author” is the same thing I mean.
This may seem unnecessary. After all, how could anyone confuse an obvious term such as “author”? But imagine that the term is “country” on a holiday website; does that term define the country the holiday is in, or does it indicate that the holiday takes place in the country, rather than in the city? Many other words also have different meanings in different contexts, and if you then add to that the possibility of different languages, you’ll soon realize that if we want to make any headway with our data, we need to be precise. And that means indicating where our terms come from.
In RDFa, we do this by indicating that we want to use a certain collection of terms, or vocabulary. This is easily done—just specify the address of the vocabulary, in conjunction with a short-form map, like this:
(If you understand XML, you’ll recognize this as the syntax for an XML namespace declaration.)
This example provides us access to the list of terms from the Dublin Core vocabulary, by way of the prefix “dc.” Dublin Core has many terms available to us, and the two we’ll use in our example are “creator” and “created.” To put them to work, we need to place the prefix in front of them, like so:
Now it’s completely clear: “dc:creator” is not the same as “xyz:creator.”
Note that the prefix mapping needs to be placed in the document somewhere “above” the location where it will be used. In our example, it could be placed on the
body element or the
html element. The full example might look like this:
<html xmlns:dc="http://purl.org/dc/terms/"> <head> <title>RDFa: Now everyone can have an API</title> </head> <body> <h1>RDFa: Now everyone can have an API</h1> Author: <em property="dc:creator" content="Mark Birbeck"> Mark Birbeck</em> Published: <em property="dc:created" content="2009-05-09"> May 9th, 2009</em> </body> </html>
There are plenty of other vocabularies to choose from, and I’ll list a few more in the next article in this series. Of course, there is nothing to stop you from inventing your own for use within your company, organization, or interest group. But note one thing that often surprises people: there is no central organization to police your work. There are best practices to follow. However, with power comes responsibility, so try to find out as much as you can about the process before you start work on a new vocabulary.
Before we return to our example, I should add one last point about vocabularies; you will no doubt be wondering why
@rel="license" didn’t get the same treatment as
@property="author", and require a prefix. The answer is that HTML already has some built-in values used with
@rel (such as “next” and “prev”), and RDFa adds a few more. One of those added by RDFa is “license.”
But once you want to go outside of this list of values—for example, to use a term from the Dublin Core vocabulary such as “replaces” or a term from FOAF such as “knows” — then you must use the prefix mapping technique in exactly the same way as we have for
For example, say our article not only has a CC license as we saw before, but it also replaces some other document—a relationship we can express using Dublin Core’s “replaces” term. We express these two relationships like this:
<html xmlns:dc="http://purl.org/dc/terms/"> <head> <title>RDFa: Now everyone can have an API</title> </head> <body> <h1>RDFa: Now everyone can have an API</h1> Author: <em property="dc:creator" content="Mark Birbeck"> Mark Birbeck</em> Created: <em property="dc:created" content="2009-05-09"> May 9th, 2009</em> License: <a rel="license" href="http://creativecommons.org/licenses/ »
by-sa/3.0/"> CC Attribution-ShareAlike</a> Previous version: <a rel="dc:replaces" href="rdfa.0.8.html"> version 0.8</a> </body> </html>
Now that we understand vocabularies, let’s get back to our main example.
Using inline text to set the value of a property
In the previous example, the duplication of the text “Mark Birbeck” in both the
@content attribute and the inline text may have jarred you. If it did, you’re certainly getting into the swing of RDFa. We can indeed remove the
@content value if the inline text holds the value that we want to use for metadata:
Author: <em property="dc:creator">Mark Birbeck</em>
@content attribute is present, then the value of a property will be set using the element’s inline text.
@content technique is derived from HTML’s
meta element, think of the preceding example as the “default” way to set a property. Providing a
@content value can be a way to override the inline value, if it doesn’t quite say what you want. It also allows authors more leeway with the text that the user reads, since they can be more precise within the embedded data. The publication date illustrates this; all of the data in the following examples have the same meaning, yet give very different presentations to the reader:
<span property="dc:created" content="2009-05-14">May 14th, 2009</span> <span property="dc:created" content="2009-05-14">May 14th</span> <span property="dc:created" content="2009-05-14">14th May</span> <span property="dc:created" content="2009-05-14">14/05/09</span> <span property="dc:created" content="2009-05-14">tomorrow</span> <span property="dc:created" content="2009-05-14">yesterday</span> <span property="dc:created" content="2009-05-14">14 Mai, 2009</span> <span property="dc:created" content="2009-05-14">14 maggio, 2009</span>
@content attribute is present, it overrides the value in the element’s inline text to set the value of the property.
In the next issue of ALA, we’ll learn how to add properties to an image—and how to add metadata to any item.