A List Apart


Illustration by Kevin Cornell

Web Standards for E-books

The internet did not replace television, which did not replace cinema, which did not replace books. E-books aren’t going to replace books either. E-books are books, merely with a different form.

Article Continues Below

The electronic book is the latest example of how HTML continues to win out over competing, often nonstandardized, formats. E-books aren’t websites, but E-books are distributed electronically. Now the dominant E-book format is XHTML. Web standards take on a new flavor when rendering literature on the screen, and classic assumptions about typography (or “formatting”) have to be adjusted.

HTML isn’t just for the web

It’s for any text distributed online.

Technology predictions can come back to haunt you, but this one I’m sure about: The fate of non-HTML formats has been sealed by HTML5 and the iPad. People are finally noticing what was staring them in the face all along—HTML is great for expressing words. The web is mostly about expressing words, and HTML works well for it. The same holds true for electronic books.

  • E-books are usually not “websites.” You can post your book copy as web pages, but the E-book as a logical entity is not a website.

  • ePub, the international E-book standard, is HTML (XHTML 1.1 with minor exclusions). Two other formats —certain kinds of “true” XML and DTBook— have equal status in ePub; most developers will use XHTML.

  • Every E-reader under the sun except the Amazon Kindle can display ePub electronic books. (A Kindle can show you its own variant, .AZW, of a variant of HTML [Mobipocket]; that’s two steps removed from the real thing. A Kindle can also convert HTML to displayable format, presumably AZW.)

It may be unseemly to dance on graves, but HTML wins again.

HTML doesn’t work for all documents, since it lacks important structural features. (HTML5 addresses some of those deficiencies but won’t help today’s E-books.) HTML does  work for huge numbers of documents, many of which we call books. Bet against HTML for online distribution and you’ve backed the wrong horse.

Philosophical digression

Every article on electronic books must ritually address the concept of book and the relation of form to book. In this case I will acknowledge the remarks of internet pioneer Jaron Lanier, who warns in his book You Are Not a Gadget that early software decisions can dramatically constrain what later becomes possible. (Others have stated the same thing—the type designers at LettError complained a decade ago about how software tools constrain ideas.)

I am articulating an HTML-triumphalist view of E-book production. By backing what I feel is obviously the right horse, I am contributing to the strangulation of new or uninvented forms of the book. Advocacy of one digital format is always a process of eugenics; other formats will never be born or will die prematurely. I’m doing that right now by downplaying the importance of XML and DTBook variants of ePub.

I am happy to contribute to the death of “vooks” and other multimedia websites masquerading as books. (I do not want a rectangle of video yammering at me while I’m trying to read.) They’re like animated popunder ads in that no actual “user” wants them, but somebody with an agenda does. Exterminating that species is something to which I am proud to contribute. For other forms of books, advocating strict HTML markup will cause as-yet-unknowable harm.

I nonetheless maintain that typical works of fiction, and many works of nonfiction, can be expressed very well indeed in HTML E-books. To attain this degree of expression, we have to rid ourselves of print conventions that do not work in electronic media.

Another way of saying this is that books should be as bookish as possible under the circumstances. Printed books need to take advantage of everything print has to offer (resolution, tactility, portability, collectibility), while electronic books must do likewise for their own form (economy, copyability, reflow, searching and indexing, interlinking).

Two problems to be solved

If HTML is the dominant markup language for most E-books, then web standards come into play. Frankly, I don’t want to relive the late 1990s and early 2000s, in which standardistas had to come up with one slightly different way after another to convince developers to code their sites properly. You still don’t see valid HTML very often on real-world sites, but tables for layout are largely a thing of the past and semantics are hugely improved. Maybe pure web standards did not “win,” but whatever web standards aren’t definitely lost.

It would be overconfident to assume that this success will immediately replicate itself with E-books. Publishers (there are barely any “developers” in the E-book sphere) will not automatically do the right thing, and so far they seem to be doing exactly the wrong thing.

If we want publishers’ code in E-books to be as good as standardistas’ code on actual websites, we’ve got two problems to solve.


The underlying code for typical ePub electronic books is XHTML 1.1. That means you need valid code with no errors: the ePub standard requires XML error handling, so you can’t get away with HTML 4.0”“style tag soup.

Novels and many nonfiction books are semantically simple. Most can get by with a tiny range of tags:

  • P (but don’t mark up everything as a paragraph)
  • Headings (arguably H1 should be reserved for the title of the book)
  • Emphasis (perennial debates over semantics of CITE vs.EM vs. I  may hereby resume)
  • Lists
  • Images (with mandatory alternate text)

Even nonexperts can be readily trained to recognize simple structures like these. But people untrained in even the simplest markup are the problem.

Production methods

For E-books to have good code, good code has to be found at every stage of the production process. That is not how things are done right now.

Screenshot: … as ?.?.?.?

Thin spaces between dots in an ellipsis become question marks. For more examples of typographic tragicomedy in E-books, see this article’s Sidebar.

Hundreds, if not thousands, of commercially available E-books from legacy publishing houses were converted to “electronic format” by scanning printed books and turning the resulting OCR book copy into text files. (Indeed just text files, not structured markup.) Copy errors are so rampant that E-books are the first category of book in human history that could actually be returned as defective. This in turn has led to the equally rampant mythology that E-books are all about “formatting.” (They aren’t: they’re about structured text with styles attached.)

Why would publishers scan hardcopies? Aren’t all books produced on computers these days? Yes, but do publishers own those files, or do various freelance designers? Can anybody even find the files? What if they were saved in an old version of Quark Xpress or Ventura Publisher?  Instead of rooting around in files resident on computers they don’t really understand anyway (these are book people), publishers find it easier to just send print books out to low bidders for scanning.

Now there’s a cottage industry selling conversion services for E-texts. One competitor in the E-book “space,” Kobo (né Shortcovers), promises conversions for “as little as $29… per title.” Another competitor, eBook Architects, converts (“to Mobipocket/Kindle first”) for about $400 in typical cases. The New York Times estimated that to “convert the text to a digital file, typeset it in digital form and copy-edit it” costs a mere 50¢.

Fees this low are unsustainably low and cannot possibly lead to good markup and clean copy.

This isn’t hypothetical. We have countless examples to look at right now (see sidebar).

Race to the bottom

E-books are barely beginning to catch on and already the most important parts of an E-book — copy and markup—are suffering from a race to the bottom.

What’s the solution? The canonical format of a book should be HTML. Authors should write in HTML, making a manuscript immediately transformable to an E-book. A manuscript could then be imported into that fossil the publishing industry refuses to leave behind, Microsoft Word. (MS Word’s Track Changes feature has become a kind of methadone for an addicted publishing industry.)

To typeset a print book from this source, translating twice (HTML → Word →  InDesign) is a proven workflow with the added advantage of outputting tagged PDFs with good semantics.

Now, the foregoing is so optimistic as to be ridiculous. Authors are not going to start writing in HTML, let alone the full-on XML that Ben Hammersley has called for.  Book copy will continue to be saved as MS Word, Xpress, and/or InDesign files. Though mangled and inadequate, such copy will then be “exported” for E-book “formatting.”

Instead of avoiding errors to begin with, the publishing industry may choose to fix errors after they’re made—but only if authors, especially big-name authors with ruthless literary agents,  complain loudly until publishers have entire imprints’ E-books repaired. This will not result in authors writing good strong HTML for new books, but will clean up part of the mess.

Ongoing E-book experiments

There’s a lot of activity in the electronic-book “space,” from virtual think tanks like the Book Oven to crowdsourced copy-editing at Bite-Size(d) Edits, to name two sites comanaged by impresarios Hugh McGuire and Stephanie Troeth. Two other projects are working on the possibilities of standardized structured code in the E-book process.

  • ePub Zen Garden aims to do for electronic-book layout and type what CSS Zen Garden did for web design, which was a lot. The new Zen Garden could benefit from the experience of the old Zen Garden by offering more than one canonical text to style, but the concept is a proven winner. (You can help by contributing.)

  • Simon Fraser University’s Thinkubator is slowly developing a project that expands on InDesign’s ability to save a complete round-trip representation of an InDesign file as XML. Converting XML output to ePub XHTML may not be trivial, but it isn’t impossible and could be automated.

    At that point, we wouldn’t have to retrain authors to write in HTML; we’d just have to retrain desktop publishers to use structural, not presentational, style names (Heading 2, Emphasis, Blockquote) for later translation. For code-competent authors, this same production method accepts XHTML as a source file, which can then be translated to a native InDesign document or PDF without intermediary files.

Separation of content and structure has never been more important

ePub uses XHTML 1.1 as a markup language. You may also associate stylesheets — explicitly CSS2, not any other version. As such and as ever, markup must be separated from presentation.

But E-book creators come from the publishing business. They’re writers, editors, desktop publishers. They will naturally attempt to hack and deform code and text to reproduce features from print layouts that should really be governed by CSS, handled by the E-book reader, or forgotten about entirely. In some cases, you actually have to alter the text of a book to make it work as an E-book; in other cases you must not do that.

Tasks CSS must handle

  • Drop caps. It’s easy to find commercial E-books the first word of which has an error: The word is written as its first letter followed by a space and the rest of its letters. It’s an artifact of drop caps, which in desktop publishing are usually rendered as a separate letter disconnected from the rest of the word. In standards-compliant E-books, you have to forget about drop caps or use a CSS selector (:first-letter).

    The same goes for type treatments on the first words (often the first n words) of a chapter or section. Maybe the first five words use small caps or bold. There is no way to do that in CSS as yet, though you can style the entire first line of a paragraph. You might have to wrap the first n words in a SPAN with a classname (which may then carry over into Word and InDesign for later styling).

  • Small caps. Software that renders HTML (not just web browsers) has a hard time with small capitals. The CSS is easy enough to declare — font-variant: small-caps. But even if the software has access to a font with genuine designed small caps, it usually won’t use them. It will use fake small caps instead (regular capitals at a smaller point size). Fake small caps are usually too short, almost always too light, and often spaced too close together.

    E-books must use CSS to specify small caps. But what you’ll end up seeing for now is fake small caps, not real ones.

  • Columns. Despite what former Microsoft researcher Bill Hill may think, multicolumn continuous text makes no sense in a window that can resize and/or scroll. (Do you want your columns continuously redrawing themselves before your very eyes?) Columns may make sense in a screen that stays fixed and immobile. For that purpose, CSS3 columns module can be attempted, though real-world use may show its weaknesses, as with positioning illustrations, column-spanning headings, and callouts.

  • Indents. One of the simplest (also least followed) conventions of book typography, indenting the first line of a paragraph that follows another paragraph but nothing else, has never been simpler to set up than in CSS:  p+p { text-indent: number }.

    Blank lines between paragraphs are a Microsoft Word artifact that are additionally widely used in onscreen text. In book typesetting, they’re a mistake (but don’t tell that to O’Reilly, the computer-book publisher that loves this “format”). If you really want a blank line between paragraphs, add a margin-bottom to P. Source copy should not be polluted with extraneous carriage-return characters, which are difficult to suppress.

Tasks the reader software must handle

  • H&J. Everyone complains about full-justified text in E-readers (text with straight left and right margins). It’s harder to read because letterspacing and wordspacing are worse, causing rivers of whitespace. The reason?  E-readers tend not to hyphenate words. Hyphenation is complex and still has not been perfected even for languages where there’s a strong market incentive to do so, like English.

    To use the industry jargon, this issue is all about H&J (hyphenation and justification). Authors need to resist the temptation to add soft-hyphen characters to E-texts. Hyphenation is purely a display convention. Hyphenation changes when the layout changes (like switching from tall view to wide view).

    E-book hyphenation should be carried out by computer algorithms and dictionaries. In print publishing, informed human proofreaders can override a system’s H&J decisions, but when you’re reading an E-book you don’t have one of those informed proofreaders seated alongside you. E-reader software has to implement hyphenation; nobody else should touch it.

  • Ligatures. One of the very first things anyone with an interest in typography learns about is the use of ligatures — usually f followed by f, i, or l. Joining the letters together into ligatures avoids unpleasant collisions, like the top of an f hitting the dot of an i.

    As with hyphenation, ligatures are purely a display artifact. Your rendering engine needs to put them in. Do not pollute your source text with ligature characters. (What if I want to capitalize large blocks of text? What if I want to search the text, or look up a word containing a ligature character in a dictionary? Of course you could program very intelligent software to overcome the problem. It’s easier to avoid the problem.) Rarer ligatures, like ct and st, are also an issue for display engines, not underlying text.

    When you need to actively prevent ligature use, as in an URL that includes the letters fi or fl, there seems to be no way around adding a zero-width nonjoiner character between the letters. (There is no CSS declaration to turn ligatures on and off, though a CSS3 proposal would let you do that.)

  • Hanging (or hung) punctuation. Typesetting some punctuation marks, like quotation marks and dashes, slightly outside the margin makes printed text look better and may also make onscreen text look better. This too is up to the display engine, not the text or its author.

Alterations to book text

Pure separation of structural markup and presentation will be impossible to achieve in books more often than on websites. Common book-typography features can be adequately expressed in E-books only by the sacrilege of altering the source manuscript.

  • Dashes. As commonly used in print books, em dash () with no spaces on either side does not work in onscreen text. Rendering engines may be too dumb to break a line before or after the em dash. Of course that may be solved someday. But in any event the character fails at its intended function — to break up text, as for appositives and parenthetical statements. En dash () surrounded by spaces avoids linebreak problems and works better at the intended purpose. (Stated concisely: Nospace-emdash-nospace doesn’t work; space-endash-space does.)

  • Space characters. You absolutely can use space characters wider and narrower than a standard word space. Em, en, and thin spaces are all defined in Unicode, along with many others, and display support is quite good and improving. A standard word space or a nonbreaking word space is the wrong character in many constructions, as between nested levels of quotation marks or apostrophe adjacent to quotation mark:

    • “I’ve Got Chills. They’re Multiplyin’ ”
      (apostrophe; thin space; end double quote)
    • “Technical is something techies do.  ‘I’m a creative — I don’t touch that!’ ”
      (end single quote; thin space; end double quote)
    • It’s a nod to the “ ’80s New Wave” sound of the Cars and Blondie
      (open double quote; thin space; apostrophe)
  • Superiors, inferiors, fractions. In theory any character can be typeset as a superscript or subscript, usually changing the meaning (πr² and πr2 are two different things). Fonts often come equipped with pre-designed superior and inferior characters, typically digits (⁰¹²³⁴⁵⁶⁷⁸⁹) and letters used in ordinals (13th, 13e) and salutations (Mlle, Sra.). Fonts often have more superscripts and subscripts than are defined in Unicode, but where a Unicode superior or inferior exists, use it instead of SUB or SUP markup.

    Math is a separate discussion. (It always is.) Nonetheless, don’t try to fake out fractions as though you were using a typewriter. The small number of Unicode characters for vulgar fractions should be used in all cases. There is no reliable method in HTML and CSS to construct fractions from superiors and inferiors and fraction slash, nor a method to create stacked fractions.

    Sections. HTML’s single biggest deficiency for long documents is its lack of sections. They exist in HTML5, but ePub doesn’t use HTML5. Sections in nonfiction books may sometimes be differentiable through the use of headings, but the classic book-design paradigm of leaving extra space between sections (with different type on initial words of the new section) simply can’t be marked up in HTML. (In uncommon cases, section breaks like these occur right at the bottom of a printed page and have to be inferred.)

    There is another tradition in book composition that can be adapted — typesetting a fleuron or dash between sections. It’s functionally equivalent to the use of HR, which can, with difficulty, be styled to be less intrusive. Nonetheless, you are still merely suggesting that sections have changed; what you are not doing is definitively encapsulating sections in their own markup.

  • Footnotes and endnotes. HTML continues to lack structures for these, for sidenotes, and for callouts like pullquotes. Footnotes have to become endnotes, which is troublesome at best for an E-book that already includes endnotes. It’s a serious deficiency.

Special note about tables

Over and over again, tables are held up as something E-books pretty much cannot do. I read this as an admission that people doing E-book “conversion” don’t understand table markup. Horrendously complex tables can be marked up in HTML. (What they might really be complaining about is how much width a table takes up — perhaps more than a certain E-reader display natively has.)


Experimenting with the form of the book is one thing, but E-book structure is not something we should make up as we go along. We shouldn’t pretend there aren’t any rules, nor should we import print-book concepts that do not work in onscreen books. The dominant E-book format of the future, ePub, can benefit from our nearly ten years’ experience building standards-compliant websites.

49 Reader Comments

Load Comments