A List Apart

Menu
Issue № 147

Using XML

by Published in HTML · 69 Comments

During my second lecture to an XML class at a local community college, I explained how XML lets you define your own markup language with custom tags and attributes. I had finished defining a simple markup language for use with a list of amateur sports clubs, and had displayed a sample document written with that markup. At that point, one student asked:

Article Continues Below

“Isn’t it inefficient to have to type all those tags for every club? What good is this? It looks nice, but what can I do with this document? How can I put this in a web page or use it with other programs? Wouldn’t it be easier to just use HTML or a database/word processor/fill-in-the-blank?”

The reason that we use XML instead of a specific application is that XML is not just a pretty face, living in isolation from the rest of the computing world.  XML is more than a rulebook for generating custom markup languages. It is part of a family of technologies, which, working together, make your XML-based documents very useful indeed.  To demonstrate what I mean, I decided to create a new XML-based markup language from scratch, and show what you can do with a document written in that language, using off-the-shelf tools.

Creating a New Markup Language

The language that I created stores the nutritional information that you find on food labels in the United States. The document starts with a <nutrition> tag, followed by a <daily-values> element that gives the maximum amounts of fat, sodium, etc. for a 2000-calorie-a-day diet, and the units in which the amount is measured.

The daily values are followed by a series of <food> elements, each of which gives information about a specific food and its nutritional categories. Because the <daily-values> element has already defined the units in which each category is measured, we don’t need to repeat them for every food; we just enter the numbers for that particular food’s total fat, sodium, etc.  After the last food, we close the document with a closing </nutrition> tag.

<nutrition><!-- Establish the daily values -->
<daily-values>
<total-fat units="g"> 65 </total-fat>
<saturated-fat units="g"> 20 </saturated-fat>
<cholesterol units="mg"> 300 </cholesterol>
<sodium units="mg"> 2400 </sodium>
<carb units="g"> 300 </carb>
<fiber units="g"> 25 </fiber>
<protein units="g"> 50 </protein>
</daily-values><p><!-- Now list the individual foods --></p><food>
<name>Avocado Dip</name>
<mfr>Sunnydale</mfr><serving units="g"> 29 </serving>
<calories total="110" fat="100"/><total-fat> 11 </total-fat>
<saturated-fat> 3 </saturated-fat>
<cholesterol> 5 </cholesterol>
<sodium> 210 </sodium>
<carb> 2 </carb>
<fiber> 0 </fiber>
<protein> 1 </protein><vitamins>
<p>    <a> 0 </a><br />
    </p><c> 0 </c>
</vitamins><minerals>
<p>    </p><ca> 0 </ca>
<p>    </p><fe> 0 </fe>
</minerals>
</food><p><!-- etc. --></p>
</nutrition>

You may see the entire document that is used for the examples in this article. All the numbers are real; only the manufacturers’ names have been changed to protect the innocent and avoid lawsuits.

A quick note: vitamins and minerals are measured in percentages, not grams or milligrams. That’s why we don’t need to establish any units or maximums for them in the <daily-values> element.

I entered the data by hand using the nedit program on Linux.  I could have used any editor that lets me save files as plain ASCII text; notepad on Windows or vi on Linux would have done equally well. To make data entry easier, I created an empty “template” for a food, which you see at the bottom of the file. I copied and pasted it for each new food, so that I didn’t have to type the tags over and over again.

Immediate Benefits

What have we bought by creating this XML file in a text editor rather than creating an HTML document or a spreadsheet or data base? First, the data is structured; it’s not just a mass of numbers in an HTML table or a text file of tab–separated values. Because of the custom tags, it’s something that humans can read and understand. It’s also open; we don’t need some expensive, proprietary software to extract the information from a binary file. So, as a transport medium, XML already serves us nicely.

Validating the Document

Even if you’re the only person who ever enters data into the document, you’d like to be able to check that you haven’t left out any information or added extra tags. Additionally, you’d like to be sure that your percentages are all between 0 and 100.

This becomes even more important if many people enter data. Even if you give other folks instructions on the proper format, they may ignore it or make errors. In short, you would like to have the computer help you determine that the data in your documents is valid.

You do this by creating a machine-readable grammar which specifies which tags and attributes are valid, and in what combinations, and what values your tags and attributes may contain. You then hand your document and the grammar to a program called a validator, and it checks that the document matches your specifications.

One machine-readable form of specifying such a grammar is a notation called Relax NG.  Relax NG is, itself, an XML-based markup language. Its purpose is to specify what is valid in other markup languages. This isn’t as crazy or impossible as it sounds.  After all, books that tell you how to use English grammar correctly are also written in English.

For example, one of the specifications of our nutritional markup language is that the <calories> element is an empty element, and it has two attributes, the total attribute and the fat attribute. These must both have decimal numbers in them.  We say this in Relax NG as follows:

<element name="calories">
<empty/>
<attribute name="total"><data type="decimal"/>
 </attribute>
<attribute name="fat"><data type="decimal"/>
 </attribute>
</element>

When we pass nutrition documents through the validator with this document, the validator will tell us that the first tag below is correct, but the second one isn’t.

<calories total="100" fat="10"/>
<calories total="217" fat="don't ask!"/>

You may see the entire grammar specification for the nutrition markup here. You may also find out more about Relax NG. By the way, Relax NG is not the only game in town if you want to specify grammar. You may use something called a DTD (Document Type Definition), which is not as powerful as Relax NG; or you may use XML Schema, which is about as powerful as Relax NG, but far more complex to learn.

Try it!

If you are feeling adventurous, you may want to try these files yourself.  You will need some XML tools in order to do this.  Here is how to set up the tools for Windows, and here’s the setup for Linux.

To validate a file, go to the command prompt if you are using Windows, or go to a console window and get a shell prompt if you are using Linux. Then use the batch/shell file described in the setup instructions to invoke the Multi-Schema Validator:

msvalidate nutrition.rng nutrition.xml

Now What?

Although we can enter readable data and check to see if it’s OK, we still can’t do anything with it. If we display it in a browser, we just see the text all squeezed together. That’s because the browser doesn’t know how to display a <food> or <vitamins> tag.

Displaying the XML

If you are using the very latest browsers, you can attach a stylesheet to the XML file. We have done that in this example by putting this line at the top of file nutrition.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/css" 
 href="nutrition.css"?>
<nutrition></nutrition>

The style sheet that we write for file nutrition.css looks very much like the style sheets that you use with your HTML files. The difference is that we assign styles to our new nutrition tags, not to the standard HTML tags. For example, to say that a food’s manufacturer should appear in 16 point italic type without starting a new line, you would write:

mfr {
    display: inline;
    font-size: 16pt;
    font-style: italic;
 }

Once you have created the entire stylesheet in the same directory as the XML file, you can open the XML file in a modern browser such as Mozilla, and it will display the information.

Transformation—A Better Way

The problems with the stylesheet are that:

  • It only works with the very latest browsers that handle Cascading Style Sheets Level 2.
  • It can’t extract all the information (for example, the units don’t show up in the output document because they are “hidden” in the attribute values.
  • It can’t calculate percentages.

Additionally, the markup we’ve invented here is data-oriented; it is designed to describe data to be stored or to be transmitted to other programs. In these documents, the order of elements and the type of data in each element is fairly rigid. Stylesheets work better with narrative-oriented markup documents. These are documents which are generally meant for human reading, and are more “free-form” than data-oriented documents. Examples of narrative-oriented markup are XHTML, DocBook (a markup for writing books and articles), and NewsML (for writing news reports).

In order to get around these problems, we can use XSLT, Extensible Stylesheet Language Transformations, to convert the nutrition file into other forms. XSLT is, again, another XML-based markup language. Its purpose is to describe how to take input from one XML file (the “source document”) and output it to a result document.  XSLT has the flexibility to extract data from attributes as well as element content, and it can do calculation and sorting upon the data in the source document.

This power makes XSLT a key technology in the XML family of technologies. For a good introduction, read Norman Walsh’s excellent presentation on the subject or this hands-on tutorial.

Transformation to HTML

The first XSLT file, which you may see here, converts the nutrition document into a very plain HTML file suitable for display on any browser on a desktop or PDA. To do the transformation, you’d type this command:

transform nutrition.xml 
nutrition_plain.xslt nutrition_plain.html

The result of the transformation is an HTML file named nutrition_plain.html, which you may open in any browser you like. Even this simple transformation has done two things that we could not do with CSS: it uses the information in attributes to display the units for each nutritional category, and it calculates percentages of the daily values.

Fancy Transformation

OK, so maybe you want something a bit fancier.  Here’s a more complex transformation which sorts the data by the ratio of fat calories to total calories per serving; sort of a “healthiness index.”

If you have saved the XSLT in a file called nutrition_fancy.xslt you can type this command:

transform nutrition.xml 
nutrition_fancy.xslt nutrition_fancy.html

That produces a file named nutrition_fancy.html, which looks remarkably different from the plain version. It uses Cascading Style Sheets to produce the little bar graphs; you’ll need a modern browser like Internet Explorer 5+ or Mozilla/Netscape 6 to see the effect. Notice that XSLT lets you pick and choose the data you want to display; the information about carbohydrates, fiber, vitamins, and minerals are omitted in the fancy version. (They could, of course, be added by changing the XSLT file.)

We have used XSLT to take the source XML file and transform it to two different HTML files; a plain version that is suitable for display on old browsers and PDAs, and a fancier version that is suitable for use with desktop computers and modern browsers.

Non-HTML Transformation

But wait, maybe you don’t want HTML; there’s more than just browsers in the world, you know. You might want to take the data and convert it to a text file of tab–separated values for import into a spreadsheet or database program.

Here is a transformation file that does this, using this command:

transform nutrition.xml nutrition_csv.xslt nutrition.csv

And here’s the resulting text file.

Conversion to Print

Let’s say you want to create a PDF file from your XML. That’s possible by using a transformation to change the XML to another markup language: XSL-FO (Extensible Stylesheet Language - Formatting Objects). This is a page layout language.  A tool called FOP (Formatting Objects to PDF) takes that markup and creates PDF files for you.

Here is a transformation file which takes the nutrition data and converts it to formatting objects.   If you save it in nutrition_fo.xslt, you can use FOP to do the conversion to PDF:

fop -xml nutrition.xml -xsl nutrition_fo.xslt -pdf nutrition.pdf

The result is a PDF file; it produces pages that are approximately 8 centimeters wide and 9 centimeters high, which fits comfortably into a shirt pocket.

Generating Graphics

Finally, you may wish to create an interactive, graphic version of the data. Another XML-based markup, SVG—Scalable Vector Graphics— gives you this capability. SVG has elements like the following, which draw a black diagonal line and a yellow circle with a green outline:

<line x1="0" y1="0" x2="50" y2="50" />
<circle cx="100" cy="100" r="30" />

By using a transformation file that produces SVG, we can construct a graphic that shows a bar graph for the food whose name you click. Here’s what you type:

transform nutrition.xml nutrition_svg.xslt nutrition.svg

You may display the result with the SVG browser that is part of the Batik toolkit. If you have installed Batik as per the instructions given for Linux or for Windows, you type batik�nutrition.svg. I have not tested the file with the latest version of the Adobe SVG Viewer, but it should work nicely. Here is a screenshot; click it to see it full size.

bar chart showing categories for a given food

Other Ways to Use the XML Tools

In this article, we’ve used the Multi-Schema Validator, Xalan Transformer, FOP converter, and Batik viewer from the command prompt. That’s the fastest and easiest way to get things working so that you can have an experience of what XML can do.

The batch or shell file approach would work in a production environment where you generate a whole website’s worth of HTML files from one or more XML files at regular time intervals. You just set up a batch job to run at scheduled times (a cron job in Unix terms) to generate the files you need.

What if you need to generate HTML pages or PDF files dynamically in response to user requests? Obviously, you don’t want the overhead of starting a Java process every time a request comes in, and a static batch file certainly won’t do the trick. Both the Multi-Schema Validator and Xalan have an API (Application Program Interface) and can thus become part of a Java servlet running on your server and handling dynamic user requests. Once a servlet is loaded, it stays in memory, so there is no extra overhead for subsequent uses of a transformation.

If you are interested in running servlets, one option is to use the Jakarta Tomcat servlet container. It can run as a stand-alone server for testing or as a module for either Apache or Microsoft IIS.

Timing

There are two aspects to timing: how long it takes to write the grammars and transformations, and how fast they run.

Designing the markup language took me about 25 minutes, and entering the data took me another 25 minutes, some of it running out to the kitchen to grab items from the shelf or refrigerator. Writing and testing the Relax NG grammar required 30 minutes.

The Cascading Style Sheet for displaying the XML directly in Mozilla took all of 15 minutes to write. The “plain HTML” transformation took about 50 minutes, including time for looking up some XSLT constructs and doing some experimentation. The “fancy” transformation took 45 minutes. I needed 20 minutes to figure out how to do the bar graphs with stylesheets in the first place, and I used another 5 minutes for minor aesthetic adjustments. The file for conversion to tab–separated values was a fifteen-minute job.

The transformation for PDF took an hour. The first time through, I designed it for paper the size of a compact disc insert. I thought better of it, and decided to reduce it to shirt-pocket size. That took another 30 to 45 minutes of tweaking and getting the font sizes just the way I wanted them. I also had to make some changes to avoid using parts of XSL Formatting Objects that FOP does not implement yet.

Finally, the SVG transformation took an hour and a half to write. About half that time was experimenting to get everything positioned nicely and making the ECMA Script interaction work properly.

You don’t have to be an expert at Relax NG, XSLT, XSL Formatting Objects, or SVG to do this. I don’t use any of these techonlogies on a daily basis. I just know enough about each of them to get things to work. In this case, my philosophy was “the first way you think of that works is the right way.” That is why XSLT experts will be shocked when they see an inefficient construct like this in the plain HTML transform file.

select="/nutrition/daily-values/*[name(.)=name($node)]/@units"

This is not to say that there is no learning involved here; you will need to spend some time on that. You don’t need to spend a lifetime on it, though. It is definitely possible to learn enough about these technologies to put them to effective use in a short time.

Performance

I tested all of these files on a 400MHz AMD K-6 with 128Mb of memory running SuSE Linux 7.2. For the transformations, I modified the SimpleTransform.java sample program that comes with Xalan. This program records the total time to generate the output and the time involved in transformation after the XSLT file has been parsed. If you are running transformations on a server, you can cache the parsed XSLT file, so the overhead for parsing occurs only once.

                                           
TransformationTime in seconds
TotalTransform
Plain HTML3.6911.018
Fancy HTML4.0571.409
Tab–separated Values3.0570.548
SVG3.3860.689

I measured the time for the PDF transformation with the Linux time command. Generating the file took 15.115 seconds real time, with 10.920 seconds of user CPU time.

Of course, these are not the only tools available. There are other XSLT processors and other programs for converting XSL Formatting Objects to PDF. I chose MSV, Xalan, Fop, and Batik because they are free, easy to use, and I was already familiar with them.

Summary

  • Using XML-based markup gives your document structure, and makes it readable and open.

  • XML is part of a family of technologies.

  • You can use grammar markup languages like Relax NG or XML Schema to validate your documents.

  • You can use XSLT transformations to repurpose a document. A single document can serve as the source for XHTML, plain text, PDF, or other XML markup languages like SVG.

  • Programs which do validation and transformation are freely available and easy to use.

These capabilities exist right now, and they are easy to learn and utilize. That is why XML is good, and why people are so excited about it once they start to use it.

You may download the XML files and the resulting HTML, text, and PDF files.

69 Reader Comments

Load Comments