Web Standards for E-books

by Joe ClarkMarch 09, 2010

Published in HTML, Industry, Layout & Grids, Mobile/Multidevice, Typography & Web Fonts, Usability

The internet did not replace television, which did not replace cinema, which did not replace books. E-books aren’t going to replace books either. E-books are books, merely with a different form.

Article Continues Below

The electronic book is the latest example of how HTML continues to win out over competing, often nonstandardized, formats. E-books aren’t websites, but E-books are distributed electronically. Now the dominant E-book format is XHTML. Web standards take on a new flavor when rendering literature on the screen, and classic assumptions about typography (or “formatting”) have to be adjusted.

HTML isn’t just for the web#section2

It’s for any text distributed online.

Technology predictions can come back to haunt you, but this one I’m sure about: The fate of non-HTML formats has been sealed by HTML5 and the iPad. People are finally noticing what was staring them in the face all along—HTML is great for expressing words. The web is mostly about expressing words, and HTML works well for it. The same holds true for electronic books.

E-books are usually not “websites.” You can post your book copy as web pages, but the E-book as a logical entity is not a website.
ePub, the international E-book standard, is HTML (XHTML 1.1 with minor exclusions). Two other formats —certain kinds of “true” XML and DTBook— have equal status in ePub; most developers will use XHTML.
Every E-reader under the sun except the Amazon Kindle can display ePub electronic books. (A Kindle can show you its own variant, .AZW, of a variant of HTML [Mobipocket]; that’s two steps removed from the real thing. A Kindle can also convert HTML to displayable format, presumably AZW.)

It may be unseemly to dance on graves, but HTML wins again.

HTML doesn’t work for all documents, since it lacks important structural features. (HTML5 addresses some of those deficiencies but won’t help today’s E-books.) HTML does work for huge numbers of documents, many of which we call books. Bet against HTML for online distribution and you’ve backed the wrong horse.

Philosophical digression#section3

Every article on electronic books must ritually address the concept of book and the relation of form to book. In this case I will acknowledge the remarks of internet pioneer Jaron Lanier, who warns in his book You Are Not a Gadget that early software decisions can dramatically constrain what later becomes possible. (Others have stated the same thing—the type designers at LettError complained a decade ago about how software tools constrain ideas.)

I am articulating an HTML-triumphalist view of E-book production. By backing what I feel is obviously the right horse, I am contributing to the strangulation of new or uninvented forms of the book. Advocacy of one digital format is always a process of eugenics; other formats will never be born or will die prematurely. I’m doing that right now by downplaying the importance of XML and DTBook variants of ePub.

I am happy to contribute to the death of “vooks” and other multimedia websites masquerading as books. (I do not want a rectangle of video yammering at me while I’m trying to read.) They’re like animated popunder ads in that no actual “user” wants them, but somebody with an agenda does. Exterminating that species is something to which I am proud to contribute. For other forms of books, advocating strict HTML markup will cause as-yet-unknowable harm.

I nonetheless maintain that typical works of fiction, and many works of nonfiction, can be expressed very well indeed in HTML E-books. To attain this degree of expression, we have to rid ourselves of print conventions that do not work in electronic media.

Another way of saying this is that books should be as bookish as possible under the circumstances. Printed books need to take advantage of everything print has to offer (resolution, tactility, portability, collectibility), while electronic books must do likewise for their own form (economy, copyability, reflow, searching and indexing, interlinking).

Two problems to be solved#section4

If HTML is the dominant markup language for most E-books, then web standards come into play. Frankly, I don’t want to relive the late 1990s and early 2000s, in which standardistas had to come up with one slightly different way after another to convince developers to code their sites properly. You still don’t see valid HTML very often on real-world sites, but tables for layout are largely a thing of the past and semantics are hugely improved. Maybe pure web standards did not “win,” but whatever web standards aren’t definitely lost.

It would be overconfident to assume that this success will immediately replicate itself with E-books. Publishers (there are barely any “developers” in the E-book sphere) will not automatically do the right thing, and so far they seem to be doing exactly the wrong thing.

If we want publishers’ code in E-books to be as good as standardistas’ code on actual websites, we’ve got two problems to solve.

Semantics#section5

The underlying code for typical ePub electronic books is XHTML 1.1. That means you need valid code with no errors: the ePub standard requires XML error handling, so you can’t get away with HTML 4.0”“style tag soup.

Novels and many nonfiction books are semantically simple. Most can get by with a tiny range of tags:

P (but don’t mark up everything as a paragraph)
Headings (arguably H1 should be reserved for the title of the book)
Emphasis (perennial debates over semantics of CITE vs.EM vs. I may hereby resume)
Lists
BLOCKQUOTE
Images (with mandatory alternate text)

Even nonexperts can be readily trained to recognize simple structures like these. But people untrained in even the simplest markup are the problem.

Production methods#section6

For E-books to have good code, good code has to be found at every stage of the production process. That is not how things are done right now.

Thin spaces between dots in an ellipsis become question marks. For more examples of typographic tragicomedy in E-books, see this article’s Sidebar.

Hundreds, if not thousands, of commercially available E-books from legacy publishing houses were converted to “electronic format” by scanning printed books and turning the resulting OCR book copy into text files. (Indeed just text files, not structured markup.) Copy errors are so rampant that E-books are the first category of book in human history that could actually be returned as defective. This in turn has led to the equally rampant mythology that E-books are all about “formatting.” (They aren’t: they’re about structured text with styles attached.)

Why would publishers scan hardcopies? Aren’t all books produced on computers these days? Yes, but do publishers own those files, or do various freelance designers? Can anybody even find the files? What if they were saved in an old version of Quark Xpress or Ventura Publisher? Instead of rooting around in files resident on computers they don’t really understand anyway (these are book people), publishers find it easier to just send print books out to low bidders for scanning.

Now there’s a cottage industry selling conversion services for E-texts. One competitor in the E-book “space,” Kobo (né Shortcovers), promises conversions for “as little as $29… per title.” Another competitor, eBook Architects, converts (“to Mobipocket/Kindle first”) for about $400 in typical cases. The New York Times estimated that to “convert the text to a digital file, typeset it in digital form and copy-edit it” costs a mere 50¢.

Fees this low are unsustainably low and cannot possibly lead to good markup and clean copy.

This isn’t hypothetical. We have countless examples to look at right now (see sidebar).

Race to the bottom#section7

E-books are barely beginning to catch on and already the most important parts of an E-book — copy and markup—are suffering from a race to the bottom.

What’s the solution? The canonical format of a book should be HTML. Authors should write in HTML, making a manuscript immediately transformable to an E-book. A manuscript could then be imported into that fossil the publishing industry refuses to leave behind, Microsoft Word. (MS Word’s Track Changes feature has become a kind of methadone for an addicted publishing industry.)

To typeset a print book from this source, translating twice (HTML → Word → InDesign) is a proven workflow with the added advantage of outputting tagged PDFs with good semantics.

Now, the foregoing is so optimistic as to be ridiculous. Authors are not going to start writing in HTML, let alone the full-on XML that Ben Hammersley has called for. Book copy will continue to be saved as MS Word, Xpress, and/or InDesign files. Though mangled and inadequate, such copy will then be “exported” for E-book “formatting.”

Instead of avoiding errors to begin with, the publishing industry may choose to fix errors after they’re made—but only if authors, especially big-name authors with ruthless literary agents, complain loudly until publishers have entire imprints’ E-books repaired. This will not result in authors writing good strong HTML for new books, but will clean up part of the mess.

Ongoing E-book experiments#section8

There’s a lot of activity in the electronic-book “space,” from virtual think tanks like the Book Oven to crowdsourced copy-editing at Bite-Size(d) Edits, to name two sites comanaged by impresarios Hugh McGuire and Stephanie Troeth. Two other projects are working on the possibilities of standardized structured code in the E-book process.

ePub Zen Garden aims to do for electronic-book layout and type what CSS Zen Garden did for web design, which was a lot. The new Zen Garden could benefit from the experience of the old Zen Garden by offering more than one canonical text to style, but the concept is a proven winner. (You can help by contributing.)
Simon Fraser University’s Thinkubator is slowly developing a project that expands on InDesign’s ability to save a complete round-trip representation of an InDesign file as XML. Converting XML output to ePub XHTML may not be trivial, but it isn’t impossible and could be automated.

At that point, we wouldn’t have to retrain authors to write in HTML; we’d just have to retrain desktop publishers to use structural, not presentational, style names (Heading 2, Emphasis, Blockquote) for later translation. For code-competent authors, this same production method accepts XHTML as a source file, which can then be translated to a native InDesign document or PDF without intermediary files.

Separation of content and structure has never been more important#section9

ePub uses XHTML 1.1 as a markup language. You may also associate stylesheets — explicitly CSS2, not any other version. As such and as ever, markup must be separated from presentation.

But E-book creators come from the publishing business. They’re writers, editors, desktop publishers. They will naturally attempt to hack and deform code and text to reproduce features from print layouts that should really be governed by CSS, handled by the E-book reader, or forgotten about entirely. In some cases, you actually have to alter the text of a book to make it work as an E-book; in other cases you must not do that.

Tasks CSS must handle#section10

Drop caps. It’s easy to find commercial E-books the first word of which has an error: The word is written as its first letter followed by a space and the rest of its letters. It’s an artifact of drop caps, which in desktop publishing are usually rendered as a separate letter disconnected from the rest of the word. In standards-compliant E-books, you have to forget about drop caps or use a CSS selector (:first-letter).

The same goes for type treatments on the first words (often the first n words) of a chapter or section. Maybe the first five words use small caps or bold. There is no way to do that in CSS as yet, though you can style the entire first line of a paragraph. You might have to wrap the first n words in a SPAN with a classname (which may then carry over into Word and InDesign for later styling).
Small caps. Software that renders HTML (not just web browsers) has a hard time with small capitals. The CSS is easy enough to declare — font-variant: small-caps. But even if the software has access to a font with genuine designed small caps, it usually won’t use them. It will use fake small caps instead (regular capitals at a smaller point size). Fake small caps are usually too short, almost always too light, and often spaced too close together.

E-books must use CSS to specify small caps. But what you’ll end up seeing for now is fake small caps, not real ones.
Columns. Despite what former Microsoft researcher Bill Hill may think, multicolumn continuous text makes no sense in a window that can resize and/or scroll. (Do you want your columns continuously redrawing themselves before your very eyes?) Columns may make sense in a screen that stays fixed and immobile. For that purpose, CSS3 columns module can be attempted, though real-world use may show its weaknesses, as with positioning illustrations, column-spanning headings, and callouts.
Indents. One of the simplest (also least followed) conventions of book typography, indenting the first line of a paragraph that follows another paragraph but nothing else, has never been simpler to set up than in CSS: p+p { text-indent: number }.

Blank lines between paragraphs are a Microsoft Word artifact that are additionally widely used in onscreen text. In book typesetting, they’re a mistake (but don’t tell that to O’Reilly, the computer-book publisher that loves this “format”). If you really want a blank line between paragraphs, add a margin-bottom to P. Source copy should not be polluted with extraneous carriage-return characters, which are difficult to suppress.

Tasks the reader software must handle#section11

H&J. Everyone complains about full-justified text in E-readers (text with straight left and right margins). It’s harder to read because letterspacing and wordspacing are worse, causing rivers of whitespace. The reason? E-readers tend not to hyphenate words. Hyphenation is complex and still has not been perfected even for languages where there’s a strong market incentive to do so, like English.

To use the industry jargon, this issue is all about H&J (hyphenation and justification). Authors need to resist the temptation to add soft-hyphen characters to E-texts. Hyphenation is purely a display convention. Hyphenation changes when the layout changes (like switching from tall view to wide view).

E-book hyphenation should be carried out by computer algorithms and dictionaries. In print publishing, informed human proofreaders can override a system’s H&J decisions, but when you’re reading an E-book you don’t have one of those informed proofreaders seated alongside you. E-reader software has to implement hyphenation; nobody else should touch it.
Ligatures. One of the very first things anyone with an interest in typography learns about is the use of ligatures — usually f followed by f, i, or l. Joining the letters together into ligatures avoids unpleasant collisions, like the top of an f hitting the dot of an i.

As with hyphenation, ligatures are purely a display artifact. Your rendering engine needs to put them in. Do not pollute your source text with ligature characters. (What if I want to capitalize large blocks of text? What if I want to search the text, or look up a word containing a ligature character in a dictionary? Of course you could program very intelligent software to overcome the problem. It’s easier to avoid the problem.) Rarer ligatures, like ct and st, are also an issue for display engines, not underlying text.

When you need to actively prevent ligature use, as in an URL that includes the letters fi or fl, there seems to be no way around adding a zero-width nonjoiner character between the letters. (There is no CSS declaration to turn ligatures on and off, though a CSS3 proposal would let you do that.)
Hanging (or hung) punctuation. Typesetting some punctuation marks, like quotation marks and dashes, slightly outside the margin makes printed text look better and may also make onscreen text look better. This too is up to the display engine, not the text or its author.

Alterations to book text#section12

Pure separation of structural markup and presentation will be impossible to achieve in books more often than on websites. Common book-typography features can be adequately expressed in E-books only by the sacrilege of altering the source manuscript.

Dashes. As commonly used in print books, em dash (—) with no spaces on either side does not work in onscreen text. Rendering engines may be too dumb to break a line before or after the em dash. Of course that may be solved someday. But in any event the character fails at its intended function — to break up text, as for appositives and parenthetical statements. En dash (–) surrounded by spaces avoids linebreak problems and works better at the intended purpose. (Stated concisely: Nospace-emdash-nospace doesn’t work; space-endash-space does.)
Space characters. You absolutely can use space characters wider and narrower than a standard word space. Em, en, and thin spaces are all defined in Unicode, along with many others, and display support is quite good and improving. A standard word space or a nonbreaking word space is the wrong character in many constructions, as between nested levels of quotation marks or apostrophe adjacent to quotation mark:
- “I’ve Got Chills. They’re Multiplyin’ ”
  (apostrophe; thin space; end double quote)
- “Technical is something techies do. ‘I’m a creative — I don’t touch that!’ ”
  (end single quote; thin space; end double quote)
- It’s a nod to the “ ’80s New Wave” sound of the Cars and Blondie
  (open double quote; thin space; apostrophe)
Superiors, inferiors, fractions. In theory any character can be typeset as a superscript or subscript, usually changing the meaning (πr² and πr2 are two different things). Fonts often come equipped with pre-designed superior and inferior characters, typically digits (⁰¹²³⁴⁵⁶⁷⁸⁹) and letters used in ordinals (13th, 13e) and salutations (Mlle, Sra.). Fonts often have more superscripts and subscripts than are defined in Unicode, but where a Unicode superior or inferior exists, use it instead of SUB or SUP markup.

Math is a separate discussion. (It always is.) Nonetheless, don’t try to fake out fractions as though you were using a typewriter. The small number of Unicode characters for vulgar fractions should be used in all cases. There is no reliable method in HTML and CSS to construct fractions from superiors and inferiors and fraction slash, nor a method to create stacked fractions.

Sections. HTML’s single biggest deficiency for long documents is its lack of sections. They exist in HTML5, but ePub doesn’t use HTML5. Sections in nonfiction books may sometimes be differentiable through the use of headings, but the classic book-design paradigm of leaving extra space between sections (with different type on initial words of the new section) simply can’t be marked up in HTML. (In uncommon cases, section breaks like these occur right at the bottom of a printed page and have to be inferred.)

There is another tradition in book composition that can be adapted — typesetting a fleuron or dash between sections. It’s functionally equivalent to the use of HR, which can, with difficulty, be styled to be less intrusive. Nonetheless, you are still merely suggesting that sections have changed; what you are not doing is definitively encapsulating sections in their own markup.
Footnotes and endnotes. HTML continues to lack structures for these, for sidenotes, and for callouts like pullquotes. Footnotes have to become endnotes, which is troublesome at best for an E-book that already includes endnotes. It’s a serious deficiency.

Special note about tables#section13

Over and over again, tables are held up as something E-books pretty much cannot do. I read this as an admission that people doing E-book “conversion” don’t understand table markup. Horrendously complex tables can be marked up in HTML. (What they might really be complaining about is how much width a table takes up — perhaps more than a certain E-reader display natively has.)

Conclusion#section14

Experimenting with the form of the book is one thing, but E-book structure is not something we should make up as we go along. We shouldn’t pretend there aren’t any rules, nor should we import print-book concepts that do not work in onscreen books. The dominant E-book format of the future, ePub, can benefit from our nearly ten years’ experience building standards-compliant websites.

49 Reader Comments

gijs says:

March 9, 2010 at 8:16 am

I’m currently researching the possibilities in e-books and this article helped me to clear out the answer i should give our clients.
hoa says:

March 9, 2010 at 10:47 am

E-books break the page numbers and the topic was not discussed.
I saw no good solution yet to that issue.
How can we get rid of the page numbers for a human ?
Should we get rid of the page numbers ?
yurikhan says:

March 9, 2010 at 11:32 am

Page numbers almost certainly need to go.

Page numbers are tied to a certain fixed page size. E-books do not and should not have fixed page sizes —Â the user may want to change the font size (and the font face, too) or read the book on a larger screen. So if page numbers were to stay, they would have to be re-generated on the fly.

Suppose now reader software allows you to scroll one line up. Are you now on page 199.98?

Also, having page numbers allows one to refer to them. “In the book “˜War and Peace’, page 472″¦” Page 472 in which page size, font size, and page orientation? Even for print books, such a reference has to be disambiguated with a specific print.

Page numbers are meaningless. Instead, we should refer to chapters, sections, subsections, paragraphs and sentences; and, for well-structured text, to anchors.

Tables of content and indexes should better be handled by reader software (what’s so difficult in collecting all headings from a correctly marked-up book?), and cross-references can use anchors.
Livio Mondini says:

March 9, 2010 at 12:52 pm

The concept of standard is very away from editorial production logic. And many printed books have complex layout, like school books or manuals. The production of contents is not linear, and dont rely to Word or other tools. The final concent is assembled in a DTP program like XPres or Indesign.
For this books, simply XHTL/CSS is not enough. And also working in a clean manner with Indesign result is a bet. And also if code is valid, is necessary verify all because images flow around, or at bottom of document.
So, this approach work only with simple design, linear text and few or no one image.
BonnieA says:

March 9, 2010 at 1:05 pm

Thank you for the discussion–you’ve cleared up several questions I have as a potential content provider.

I’ve been trying to research app standards in regard to graphics files, as well. ARE there any as to resolution, pixel dimension, screen ratios, format, etc.? Do standard web screen resolutions apply for teeny-tiny viewing screens?

We picture book illustrators are used to a standard print format of 32 pages–is there any corresponding “average” length (number of screens?) for original content?

Thanks!
llasram says:

March 9, 2010 at 2:42 pm

The non-Shortcovers EPUB edition of _The City & The City_ I read didn’t have any of the problems you described. It appeared to have been produced by InDesign (implying some degree of manual work) and even embedded a font which contain glyphs for all of the unusual characters used.

Not that ebook production doesn’t have problems right now, but I think Shortcovers is a bit worse than the norm.
jmaloney says:

March 9, 2010 at 3:54 pm

Putting aside the question of whether or not page numbers are useful for e-books, it’s not true that e-books do not support fixed page numbering. Epub provides a method to map the specific page numbers so that the e-book page numbers will correspond to print page numbers (see here: http://blog.threepress.org/2009/11/26/adobe-page-map-versus-ncx-pagelist/). Since we’re talking about standards, I think this is worth addressing. Page numbers are not meaningless — they have a definite use in citing references, which is important for any type of scholarly work.
Joe Clark says:

March 9, 2010 at 4:37 pm

For what could be called pure electronic books, even if there is also a printed book, page numbers indeed don’t make any sense. An index or table of contents has to use hyperlinks, not page numbers. (And don’t think an index isn’t useful. I assure you that a search function is not one-tenth as useful.)

However, for alternate formats that are meant to be a conceptual duplicate of a printed book, you do need to encode page numbers somehow. The standard example is books for the blind. Large print never does this, but Braille books can print the original page number and the Braille page number on each sheet; analogue talking books play a tone when the reader turns a page (this is outdated, obviously); DAISY electronic talking books can notate original printed page in various ways, though it’s been a while since I read that spec.

I grant this causes complications when you want to write a bibliographic citation for a “book” when all you have is the E version and your readers are almost certainly going to be looking at the P version.
Joe Clark says:

March 9, 2010 at 4:43 pm

Livio, you are articulating a well-accepted position, but there are flaws in your foundation.

A printed page made for a reader with no relevant disabilities is a random-access medium. You can look anywhere you want and read anything you want in any order, or just put the book on the floor and admire a double-page spread.

Electronically, the issue becomes reading order (“logical” reading order in the terminology). Do the contents, when read start to finish, make sense? If you jump in at a certain point and read _from there_ to the finish, does it make sense?

In InDesign, proper threading order of text frames results in (e.g.) a tagged PDF with a logical reading order. It is true that the designer must make a decision as to when the reader is to experience a callout or a sidebar. My experience is this is only occasionally a real cause for debate.

The same applies to actual E-books. You need a logical reading order. ePub allows CSS placement of callouts and sidebars, which could instead (or in addition) be separate files.

Not every aspect of print graphic design can be duplicated in electronic document design, which relies fundamentally on structure, not inferences drawn from appearance.
mxtbcca says:

March 9, 2010 at 6:18 pm

To so sure that an “author” should be making the decisions on which type of space character should be used in a given context. Seems like that should be handled by the rules engine that proofs the manuscript for well-formedness.

mxt

THINK
think different
Think Open Source

“I need more space” – Creature Comforts
Joe Clark says:

March 9, 2010 at 7:32 pm

More accurately, authors shouldn’t be allowed to use the _wrong_ characters. That’s why we need editors, the first ones to be fired (or cheapened by outsourcing or hiring greenhorns straight out of university).
Kip Robinson says:

March 9, 2010 at 8:06 pm

bq. Everyone complains about full-justified text in E-readers (text with straight left and right margins)

I guess I’m in the minority then. I much prefer justified text, and lament its absence on most websites. Things just look cleaner when that right margin is aligned.
Dena TasarÄ±m says:

March 9, 2010 at 9:17 pm

Page numbers definitely need to go…
epub says:

March 9, 2010 at 9:39 pm

Earlier this evening I made a Twitter (@epub) response to this article which was, “Great article from ALA…Web Standards for E-books…but I still have doubts about HTML”. Joe responded by asking what doubts – perhaps I didn’t use my 140 Twitter characters well enough – 140 characters probably aren’t enough anyway.

In essence I agree with most of what you say Joe, though it’s all those “issues” that is the reason why I said I have doubts.

I’d put the argument of HTML for eBooks in to two camps; Yes for the indie author, enthusiast or small publisher and No for the big publishers.

Let’s face it, those indie authors just want to get their books out there; they certainly don’t want to go learning some new-fangled XML markup. They know HTML, they’ve been reading ALA so are up on standards 😉 and they are happy to continue that way.

The big publishers though (who are ultimately the important sector as they are making vast amounts of money from selling these eBooks and so should be getting it right) will need to consider something more structured and which is designed specifically for marking up the _book_ language. This would be where a language like DTBook would give them much better control of the structural elements, which ultimately will give the reading system (eReader) better control over how to display that text. If the eReader sees a footnote tag, then it knows exactly what it is and can apply it’s rendering appropriately, rather than having to guess as to what the attribute “foot” or “fn” or even “ftnt” means — I believe we should use CSS only as a _guide_ for the eReaders, as perhaps the end user may wish to use their own custom styling.

“_…I’m doing that right now by downplaying the importance of XML and DTBook variants of ePub._”

Personally I think we should be playing-up the importance of DTBook. It is certainly not some new upstart and could been a great language for publishers to use on their titles, especially with its use in accessibility circles, but I have to say that I’m not aware of anyone having actually released an EPUB book with DTBook under the bonnet – not very positive.

You wrote a long article Joe and it’d take me another long article to reply to all what I’d like. To wrap up;

I feel a good Industry Standard XML Markup (DTBook) would cause far fewer issues than what would be produced from using HTML (I should note that I haven’t yet looked at HTML5..! so can’t comment on that) though an eBook format needs to also support HTML for the smaller developers. The big publishers, who will need to use mashups, extracts, etc., in a much bigger way, should probably go down the XML route.

I’m a big supporter of EPUB because it supports both these.
charleski says:

March 9, 2010 at 10:17 pm

The biggest problem with ebooks is that they’re being produced by technicians who just need to get them out the door. Sometimes I wonder if they’re paid by the kilobyte or something. Publishers seem to be having a hard time hiring people who know about typography _and_ know how to produce well-crafted markup code _and_ are willing to check their results and fix the mistakes.

Open up a book from Penguin and you’ll find a massive 38kB css file containing every class the designer has ever used, each helpfully named with an obscure code. Just in case anyone thinks this might be a central cache of house styles, the values defined in the classes often change from book to book, usually with extra styles tacked on at the end with their own cryptic name. In one case a 34kB core repository of hermetic mystery is helped along the way by another 109 separate css files, 3 for each chapter, and most of which are empty.

Random House US does at least use meaningful names, but its early attempt to maintain a standardised list of styles is groaning under the weight of an ever-lengthening list of ‘Added Styles’. To be fair, RH UK does seem to have avoided the blight and produce tailored styles.

At least one technician working for macmillan doesn’t understand the difference between ISO-8859-1 and UTF-8 and why the former should never be allowed anywhere near an ePub.

I could go on, but you get the point.

It’s no wonder the end result suffers. The principles for producing well-crafted maintainable code apply to xhtml/css just as much as to any other form.

Each book deserves to be crafted with the same care. Develop a house-style and stick to it. Import only the styles you need and add new styles only when needed. Above all, *check the output*. Check it on the desktop version of ADE and check it on an actual reader.
PeterSchoppert says:

March 10, 2010 at 4:24 am

As a publisher and epub newbie I’ve got lots of concerns and questions about epub esp its ‘default’ xhtml flavor, aside from lack of quality in implementation. Here are a few:

I think we should try and support pages, if only to make references from print map onto our ebooks. Joe linked to the ThreePress “blogpost”:http://blog.threepress.org/2009/11/26/adobe-page-map-versus-ncx-pagelist that mentions two approaches, from Adobe and the epub standard-compliant method via NCX/DTBook. But Keith of ThreePress says that no reading systems he knows of support the standards-compliant method.

We definitely *also* need a chapter>paragraph>sentence>(word?) reference system for ePub. I would be very interested in discussions of different approaches here. Endless divs and spans with unique ids?

We need this also as a base to support annotations. Everyone (Adobe, Stanza, Ibis, Apple) seems to be introducing proprietary solutions for annotations, in some flavor of XML (that gets written into a file and added to the epub manifest? or which lives only in the reader?), but eventually we want to share and aggregate annotations.

In his recent NYBooks “piece”:http://www.nybooks.com/articles/23683, Jason Epstein suggests that only physical books can retain a morally-important kind of authority (of authorship, of time) that texts need. Only print can provide against the insults of centrally-controlled DRM, ease of changing digital files, etc. He’s got a point. We’ve got a lot of work to do to make ebooks not only convenient to read, but also authoritative records of texts at particular points of time. Records which can, like print books, acquire their own histories.
David Leader says:

March 10, 2010 at 4:00 pm

I find this article truly bizarre. Mr Clark is clearly in ecstasies over the fact that ePub format that has been adopted for delivery of most ebooks is based on XHTML, but singularly fails to explain why. He assumes that because HTML (and the failed XHTML) are web standards then it is automatically a good thing that they should be used for web books, even though he admits that they are not designed for and lack the constructs to properly display books on the web. Anyway, if you were starting now, would you really design HTML and CSS the way it is? No, neither would I, I would profit from the fifteen years experience and wipe the slate clean and construct new and better standards (Java v. C anyone?). Clearly an HTML/CSS-based format was the easiest thing to do, the easiest thing to get agreement on, and convenient for Apple to use to fight Kindle. But a triumph for the W3C? Get a grip.

And from the sublime discussion of web standards the article equally bizarrely descends to the level of presentation of em dashes. Well let’s hope most of the books are written in English English, where the typographical convention solves Mr Clark’s problem. But, somehow I don’t think the e-Publishers will be consulting Mr Clark about the presentation of em dashes or anything else.
johankool says:

March 10, 2010 at 4:45 pm

The last character in this line is not an end double quote, but and open double quote:

“I’ve Got Chills. They’re Multiplyin'”‰”
(apostrophe; thin space; end double quote)

Otherwise, great article!
Joe Clark says:

March 10, 2010 at 5:36 pm

Go back and reread the entire article, making sure to follow any link that even remotely seems to be discussing the limitations of HTML.

Important note: Your objections were easily structured in HTML.
Joe Clark says:

March 10, 2010 at 5:40 pm

Peter Schoppert, I have come to believe that page numbers in E-books make sense only if the format actually is meant as a copy of a printed book, viz an alternate format for a blind user. Otherwise page numbers are a mere skiamorph.

It’s trivial to include fragment identifiers for all block-level elements, including e.g. headings and paragraphs, making it easy to link to any of those. I frankly don’t see the need to be able to link to any individual word post-facto.

Nonetheless, the problem of a defined structure for annotations is real and rather pressing for production workflow. (Why else are people addicted to the methadone of MS Word?) I have no solution, but then again, I can’t be expected to have one.
Joe Clark says:

March 10, 2010 at 5:40 pm

JohanKool, we are aware of more than one necessary correction.
Richard Fink says:

March 10, 2010 at 6:18 pm

Your main message is absolutely correct – the same mix of HTML, CSS, and JavaScript that drives the web will drive e-books, too. As you point out, there’s no prescience involved, just a willingness to accept the obvious. I would say that, I too, am an HTML triumphalist, but it’s too damned difficult to pronounce. 😉
I *would* go one step further, though. That same mix of HTML+ will be the engine of print publishing, as well. When I see the kind of time and energy being spent by someone like Håkon Wium Lie on HTML-to-PDF conversion software like “Prince”:http://www.princexml.com/overview/ , it seems I’m not alone in that belief, either.
If you take into account the number of web pages printed in a day, you could argue that HTML already *is* the main engine of desktop publishing.
We just don’t think that much about media=“print”, yet. The tools aren’t far enough along. The economics of print publishing don’t quite yet demand the move to one-off print runs. But the day is coming. And this throws a different cast on what e-books can be – which, in your article, assumes a strictly onscreen display.
Some thoughts:
*Justification and Hyphenation* – Dismissing this as “purely a display arifact” seems, well, dismissive. hah! Why don’t you just say H&J isn’t necessary and nobody should bother with it? I think that’s wrong but happily there is no conflict (nor should there be) between providing – within the same HTML-based e-book document – “typeset” H&J text and “ordinary” text. Each can have its own stylesheet. One does not preclude the other. The only downside is more bytes in the file, that’s all. If someone wants to put in the work so it looks a particular way onscreen or in print, let ’em. Please take the “don’t touch” sign off.
*Spacing* – Geez,is it possible I “influenced”:http://readableweb.com/ie8-bug-html-spacer-entities-create-one-pixel-jog-in-line-height/#comment-14 you?
Yes, the emspace, enspace, and thinspace characters are perfectly legit – they, and the other general punctuation characters (8192 thru 8203) can and should be used. Turning to CSS and spans and jumping through hoops for what these spacing characters do effortlessly is ridiculous. There’s a couple of wrinkles though. Most browsers synthesize spacing characters from the metrics of the font, even if the font does not specifically contain the character. (Many newer fonts *do* contain these “empty” punctuation characters and certainly *every* font to be used with @font-face should.) Strangely, synthesizing the space is actually non-conformant with CSS 2.1. What Opera does – which is show the “not defined” character in a rectangular box – feels wrong, but it’s technically correct. Incidentally, the “web safe” fonts don’t contain these spacing characters either – what you see when you specify   is synthesized by the browser.
*Scrolling* – I believe the need to scroll is a provable “concentration breaker”. And talk about display artifacts – was the scrolling window not mostly a matter of programmatic convenience? And why doesn’t Barnes & Nobles carry scrolls instead of bound volumes? A scrolled page must end somewhere, right? So why not end it where there is no more screen to display it? (But on the other hand, an insistence that a page onscreen must look like the page of a book is nonsensical, I’m with you there.) Related to scrolling is columnized layout – there too, I believe, there are provable advantages. Especially when “skimming” the text for points of interest.
*E-Readers* – right now, apps like Mobi and Stanza and the like may be necessary but in the long term they will fade away. The idea of an e-book as a “book” existing isolated and apart, unchanged and unchangeable until the next “edition”, disconnected from a network, is absurd.
Mobi, Stanza, who needs ’em? Everybody already has an e-reader, it’s called a _browser_.

“Rich”:http://readableweb.com
knott says:

March 10, 2010 at 8:47 pm

Dear Joe, I write all my papers with a plain text-editor
(emacs), in the TeX “markup language” (some people refer to
‘latex’, since that is a popular macro package for TeX.)
(e.g. X_i means X subscript i and alpha is the letter alpha in TeX.)
I end up with a PDF file full of nice math text and
everything else I need. (Go see some of the papers at http://www.civilized.com) I don’t think HTML5 or anything else
I have heard of would handle the preparation of a document that
might have some technical content. – Gary Knott.
netman49 says:

March 11, 2010 at 11:47 am

Very interesting article and worthwhile reading.

It’s also my experience that HTML is the best basis for ePub eBook production, but the problem is that most book source files (anyway in my case) are in Word or RTF.

Converting those sources to proper HTML as described in this article seems to be an utopia. Carriage returns, for example, are converted to paragraphs and blank lines between paragraphs are converted to separate paragraphs with

And so on…..

So what we need is a proper DOC/RTF to HTML converter, Not the “Save as HTML (filtered)” in Word, because that doesn’t work!
ncarr says:

March 12, 2010 at 4:18 am

The lack of HTML approximates, or equivalents, for page numbers, sections, footnotes/endnotes (citations), cross references and indexes, isn’t because the markup language isn’t rich enough. It’s because the content isn’t rich enough. This can be over come by simply adding unique identifiers to the document objects for use as anchors.

Still, as simple as the solution is, it would be great if someone would take the lead and publish some basic conventions so that the problem didn’t have to be solved and re-solved over and over and over again. The citation industry is both progressive and vibrant but it is geared towards problems a lot bigger than putting a few footnotes in an ebook.

This is almost a job for microformats…
imdesigner says:

March 15, 2010 at 12:23 pm

Hi there,

Thanks for sharing such a great article. I totaly agree with the Statement you’ll mentioned that “HTML isn’t just for the web. It’s for any text distributed online.” Any text that is being shared online is definitely related to the HTMLs.

I am a designer myself and I find HTML the main source of all the data being trasfered or shared on the web, the design we create tell the visitor what things are displayed and used on our page, but everything that is present on the page is because of the HTML.

Nauman Akhtar
Gunnar Bittersmann says:

March 15, 2010 at 2:50 pm

Thanks for the article. A remark or two:

bq. Dashes.”ƒAs commonly used in print books, em dash (—) with no spaces on either side does not work in onscreen text. [“¦] En dash (—) surrounded by spaces avoids linebreak problems and works better at the intended purpose.

The ALA styleguide seems to say: use em dash with no spaces, not en dash with spaces. Subject to change?

I’m not sure about English, but German typography forbids a line to start with a dash; correct usage would be: no-break space; en dash; space.

bq. “I’ve Got Chills. They’re Multiplyin'”‰” (apostrophe; thin space; end double quote)

There must not occur a line break between punctuation characters, hence U+202F narrow no-break space should be used: apostrophe; narrow no-break space; end double quote.
Kat B says:

March 15, 2010 at 11:48 pm

Perhaps I have misunderstood the article, but it seems to be confusing HTML with XHTML. They are very different beasts, although, at first, it is difficult to tell them apart because they do look similar. They have very different behaviours. XHTML is cool 🙂

It is more acceptable to interchange XML with XHTML, as XHTML is a flavour of XML.

Also, the link to Ben Hammersley’s call for XML seems to end in 404.

And as for Java vs. c, there are some things that c can do that Java struggles with, even though it means the programmer must then struggle with c. *sigh*
sdelong says:

March 17, 2010 at 10:45 am

Thank you for an excellent article and discussion. I hope you are planning to write a book on this topic. I would love to see real-life examples and would definitely purchase it.
charleski says:

March 19, 2010 at 1:34 am

bq. The ALA styleguide seems to say: use em dash with no spaces, not en dash with spaces. Subject to change?

That’s the American standard from the ??Chicago Manual of Style??, and frankly I think it looks terrible on reflowable material unless the renderer is smart. Bringhurst says to use en dashes with spaces, which is the English standard and generally works better with the rather dumb rendering systems used in embedded devices.
daniel.perera says:

March 22, 2010 at 5:47 pm

Very interesting article, it shows you were born in that jungle. This is why you shoud visit http://www.makeyourebooks.com. We got the XML tagging to the industrial processes. I am sure you will be interested.
10basetom says:

March 23, 2010 at 8:27 am

Great article. Maybe it’s just a subtle difference in semantics, but I feel the use of the term “online” is too restrictive; I would use the term “digitally” instead. When people read ebooks, they won’t necessarily be online (i.e., connected to the internet). Hence:

“HTML isn’t just for the web… It’s for any text distributed -online- +digitally+.”

“HTML is the preferred way to mark up and publish -online- +digital+ documents that are _not_ websites.”
spotrick says:

March 23, 2010 at 10:25 am

Hi. I’ve been producing ebooks since 1998, so I may have something worth contributing to this debate. I began way back then with the premise that one could use HTML to format a book _so as to make it readable online_. This premise was borne out of an aesthetic reaction to reading in plain text.

I quickly adopted a second premise: that these ebooks would be new editions, rather than attempting to be facsimiles of a particular print edition.

I’ll be the first to admit that my earliest efforts were quite horrible. (Although I thought they were good at the time.) HTML by itself is inaqequate to the task. But HTML with CSS produces — I believe — results that are the equal of print for the vast majority of books. I’m slowly upgrading the older editions to match my current standard.

To address some of the issues raised:

1. Numbering. Page numbers make no sense in ebooks (as new editions) because there are no pages. They are an artifact of print. However, we are so used to being able to reference text via page number, that many readers are lost without them, no matter how many times I say it’s perfectly OK to cite using a URL.

In my ebooks, chapters are numbered, as are sections, parts and anything else likely to be referenced in a ToC. But I don’t number paragraphs (although it would be easy to add ‘id=”n”‘ to each para.) because I don’t see an easy, unobtrusive way to inform the user of their existence.

2. Sections. My approach here is to simply use a DIV wrapper to each section (or chapter, …) with a class attribute. E.g. your example of spacing between sections is easily achieved with

and “section { margin-bottom:2em; }”

To that extent, I don’t believe HTML lacks structural features, since you can define them arbitraily as needed.

3. I note the appearance in the comments of various suggestions as to why HTML is not as good as TeX/Docbook/etc. This is depressing because I recall similar discussions from ten years back.

ePub is _the_ best format for ebooks because, unlike all those other formats, HTML is ubiquitous, and easy enough for authors to manage. Or, to be completely minimal, authoring can be done in plain text, which can then be easily converted to ePub/HTML in a largely automated process. I do this daily with books from Project Gutenberg.

4. I would argue that many of the typographical conventions are actually “house style” — e.g. the use of small caps. Much can be managed with css.

5. XHTML/CSS can handle quite complex formatting — e.g. positioning of text around images, margin notes.

If interested, my ebooks are available at http://ebooks.adelaide.edu.au
I do not claim to have everything right. But I welcome constructive criticism.
luismorais says:

March 25, 2010 at 7:36 pm

Now, the foregoing is so optimistic as to be ridiculous. Authors are not going to start writing in HTML, let alone the full-on XML that Ben Hammersley has called for. Book copy will continue to be saved as MS Word, Xpress, and/or InDesign files. Though mangled and inadequate, such copy will then be “exported” for E-book “formatting.”

I believe Ben’s article has been taken down.

I believe those still in confusion between the writing creative process and the technical (although not destitute of its own creativity) book creation process would benefit from reading on how writers write.

I believe that after a few testimonials one will see that with matters of heart and inspiration, technology is more than welcome to be useful but only whilst staying ubiquitous:

J.G. Ballard: How I write:
http://entertainment.timesonline.co.uk/tol/arts_and_entertainment/books/article439694.ece

How I Write by Bertrand Russell
http://www.solstice.us/russell/write.html

How I Write by Richard Milward
http://www.faber.co.uk/article/2009/2/how-i-write/

Neil Gaiman: how I write
http://www.timeout.com/london/books/features/2100/Neil_Gaiman-how_I_write.html

http://www.thewritingcentre.com/how-i-write

It still is all about the users, even when they are the authors themselves.

Cheers!
bowerbird says:

March 26, 2010 at 9:36 am

so, joe clark has joined the .epub adherents proclaiming
that all other formats, heretofore and in the future, must
die. yawn. html, with or without the x, is already fading.

many of us are already in the process of making _better_
formats, and we ain’t stoppin’ just ’cause joe says so…

look at the comment box for this very blog article.

it allows _textile_, because authors *like* light-markup
— because it stays out of our way when we’re writing.

and sure, then we convert it to (x)html, and from there
it might get shuttled on to .epub format, but _why_?

why not just have the rendering agent take textile input,
convert it to (x)html itself, and then to .epub if it must,
and _then_ display it. why do we have to do all of these
conversions, that the machine can do just as well itself?)

the answer is that we don’t.

we feed the machine textile (or another light markup).

and once the machine realizes that it could render our
textile file, as raw input itself, and render it as easily as
(x)html or .epub, it won’t even bother to do conversions.

make all the proclamations you want, joe.

we ain’t listening. and we ain’t stopping.

we’re making the future, and you can’t fight it.

-bowerbird
luismorais says:

March 26, 2010 at 12:42 pm

Hi bowerbird,

I find Joe Clark the kind of writer that forces me to re-read things constantly, he’s got a writing style that declares the most absurd things upfront, factually and without prejudice, and then discusses if they are relevant/feasible/sensible or not, establishing exceptions just later.

But I think he is not really saying that HTML will serve all books as a single distribution solution:

***

“HTML doesn’t work for all documents, since it lacks important structural features. (HTML5 addresses some of those deficiencies but won’t help today’s E-books.) HTML does work for huge numbers of documents, many of which we call books. Bet against HTML for online distribution and you’ve backed the wrong horse.” – Joe Clark

***

As far as I can see, HTML can serve most books out there but it might not serve those books whose text formatting is also part of the narrative as it happens with certain styles of poetry where the space and arrangement of the words on the canvas/paper also convey meaning, as it is the case of concrete poetry, a poetic style still very popular in my motherland Brazil (http://en.wikipedia.org/wiki/Concrete_poetry).

I believe the article doesn’t exclude the possibility of those literary exceptions that make literature a 4 dimensional experience, an experience which HTML still can’t represent easily.

I am not totally against the creation of a full-blown XML language for literary content that tried to close this gap instead of HTML (as long as we don’t assume the Philistine attitude that writers must write under the terms of a technical format standard).

Nevertheless, in the current context, HTML can still be useful for a great range of standard publications and books. In my humble opinion.

Cheers,

Luis
joseph says:

March 31, 2010 at 1:31 am

I enjoyed the article — it’s exactly what needs to be said. We’re in the middle of an illusory “backlist goldrush” at the moment, meaning digitize and damn the quality, but the glut of bad ebooks will soon prompt publishers to think they can differentiate big releases on quality again. I don’t think it’ll take too long.

Anyway, a correction on the NY Times reference. The linked article says: “So on a $12.99 e-book, the publisher takes in $9.09. Out of that gross revenue, the publisher pays about 50 cents to convert the text to a digital file, typeset it in digital form and copy-edit it. Marketing is about 78 cents.”

It’s kind of mind-boggling that one could read that and think it meant 50 cents total. It’s 50 nominal cents per copy sold.

But otherwise, a great article. More of this please, ALA.
citizencontact says:

April 9, 2010 at 3:19 pm

There have been comments about how to render footnotes within .epub/xhtml. Although this is important, a corresponding feature should be to allow the ebook to be cited, especially with non-fiction. Generally, it is simple to reference a paper copy, but citing an electronic version is oddly trickier.

First there should be a URL that is associated with the publication. It may be that there are thousands of posted versions of public domain books, but each should have a corresponding URL. On the possibility that each version should have differences, intentional or not, having a separate URL for each instance is important.

Second, the publisher should make sure that each portion that might be cited should have an id attribute built into the document. That could be based on div or paragraph tags, or even span tags.

Third, there should be an obvious way to show the embedded id so that anyone citing or copying a small portion can easily include the citation back to the document being cited. I came up with a simple way using standard HTML tags that allow this, that I call Embedded Self Cites. The citation would be a URL that would include an HTML fragment reference (e.g. http://ebook.example/booktitle#chapter1para4 ). (see http://advocatehope.org/tech-tidbits/embedded-citations )

In addition to using a URL for a citation, each ebook, ebook abstract or ebook portion that can be cited with a URL, could be printed on paper with a QR barcode for URL that would allow most smart phones or ebook reader with camera to scan in the URL barcode and go to the ebook directly or a pay page/log in. (see: http://docs.google.com/Doc?docid=0AV6jPr0LRFa0ZGZ4Z2NkZmNfODZnZmpnbXJkdg&hl=en_GB#Bar_Codes )

– Daniel Bennett
uk-hosting says:

April 14, 2010 at 2:54 pm

Great help on e-books, i find e-books can be great for learning online and with modern applications such as iPhones they are easy accessible.
Jeff Seager says:

April 15, 2010 at 3:06 pm

Dead-on commentary, Joe. Since the early days of computer typesetting, publishers have been using some form of simple inline markup for formatting. You’re so right to say that HTML is the evolutionary heir to all that.

People need not be daunted by the apparent complexity, either. As you say, it’s only a limited subset of HTML that’s needed for most E-books. Learning HTML is almost unnecessary for publishers if they adopt the use of “Markdown (a text-to-HTML conversion tool for web writers)”:http://daringfireball.net/projects/markdown/ (I have no personal interest in Markdown, which is offered free, but as soon as I saw it I recognized its value).

As always, Joe, I really appreciate your insights.
dpapathanasiou says:

April 16, 2010 at 1:14 pm

Sigil is an open source project which is also challenging InDesign. It is a WYSIWYG editor which runs on Mac, Windows, and Linux.
Elizabeth Castro says:

April 16, 2010 at 3:20 pm

I was dismayed that in your article about eBook Standards, you did not address the issue of eReaders deliberately not following the spec in their benevolent but misguided attempt to display ebooks more attractively to readers (as described at the beginning of this article: http://www.pigsgourdsandwikis.com/2010_03_01_archive.html)

I think it is absolutely essential for Web standardistas to stand up right now and call for eReaders to follow those specs. Otherwise, it is only a matter of time before someone does develop an eReader that listens to the book designer regardless of the standards (Netscape anyone?) But by then it will be too late.

Further, I am so incredibly tired of people being paternalistic about what should be allowed and what shouldn’t. Whether you think video in a book is useful is not relevant to eBook standards. I can think of many instances, particularly in technical books, in which case a short video tutorial would be much more instructive than a series of screenshots.

And Apple says not to use fonts when designing eBooks for the iBookstore, because it “creates a bad user experience”. Excuse me? Get your hands off my design. If you don’t like it, don’t buy it.

Regardless, it’s not for you, Apple, or anyone else to decide on aesthetic grounds. Either it follows the standards, or it does not.

(and frankly, it’s a pain constructing a complicated rebuttal in this tiny text box!)
David Leader says:

May 3, 2010 at 2:13 pm

When I posted initially I hadn’t actually looked at any ebooks, which is why the idea that they should be done in HTML seemed so bizarre to me. Since then I’ve acquired an iPhone (not an iPad) and have downloaded a few free ebooks in different formats, and the eReader. I can see now why Mr Clark was going on about em-dashes and block paragraphs (hatred of the latter being one area I agree with him) – and to which I’d add dumb quotes – because the typographic execution in the examples I’ve seen (and those in screen shots from paid books) varies from poor to dreadful. I certainly couldn’t bring myself to read anything on my iPhone set like that, and I see little prospect of any improvement, except perhaps on the iPad, where the Kennedy book that Jobs presented may raise expectations. Mr Clark seems to have got his wish for the triumph of HTML in eBooks. He should have perhaps been a bit more careful what he was wishing for.
Brian Kim says:

May 18, 2010 at 5:37 am

Thanks Joe Clark. Good points. Many times change for the good means diminishing status for icons.

Thanks a list apart. Another fine topic well presented.

With respect to ebooks. Purchase two books a couple months ago from O’Reilly.

Automating System Administration with Perl, 2Ed
CSS Cookbook, 3Ed

Purchased – both print and ebook. It was bundle, and I was curious. For my purposes, with Acrobat 8 Pro, the PDF version has more utility than EPUB. At least for now. BTW, I don’t have a reader.

I viewed the EPUBs with Adobe Digital Editions 1.7.4 (I think) on a Vista PC.

But for those interested, I looked into the EPUB files with PKZIP. Perhaps some of the other commenters may be interested.

I unzipped the EPUB file and looked inside the files with Wordpad. There’s both HTML and XML inside.

Eg.for Automating System Administration with Perl, 2Ed
There’s XML here –
in OEBPS/content.opf
in OEBPS/toc.ncx
in META-INF/container.xml
and HTML in
in OEBPS/index.html
Kind of quirky, it looks like this –

<?xml version=”1.0″ encoding=”UTF-8″ standalone=”no”?>

<html ><head>
… etc.

EPUB looks more like a hybrid.

I tried looking here too –
OEBPS Container Format (OCF) 1.0 Specification Documents
HTML Version
http://www.openebook.org/ocf/ocf1.0/download/ocf10.htm
See APPENDIX B: Example

Have a good week.

Brian
abathur says:

May 20, 2010 at 2:05 am

I’m a writer, a poet at least. My current project, though–a custom document management system to resolve some bad version control problems I run into as I write at home and on the road–led me to see what ALA had to say about the topic.

While I agree with the value of HTML for presenting these documents (which represents the potential for a massive open-standards win over the proprietary systems to which we lend our literary art and scholarship) I’m not sure the XHTML-now approach is the best. I’m inevitably biased by my own proclivities, but I think it’s easier to expect writers, editors and publishers (“book people” as Clark says) to produce well-formed XML with semantic markup in terms they already understand.

I don’t think it’s hard to convince a poet that she can write his poems in XML with tags like and <TITLE> and and but it’s going to be a much greater leap (and will, undoubtedly, be pushed off on someone else) to expect her to understand how to use HTML’s not-quite-perfect-yet markup in a semantic manner. Even if a few variations of the XML tags crop up, the sets used will be relatively simple, and this sort of parsing made relatively trivial.

Let writers and publishers invoke their natural vocabularies through XML, and let the standardistas make decisions about how to parse this XML into display-ready XHTML as necessary. Then our parsers can adapt to the further evolution of HTML as a language (and common practice in e-publishing, no doubt) by continually adapting comprehensive XML documents into the markup that makes the most sense today, and markup which makes sense 20 years from now.
Brian Kim says:

June 25, 2010 at 9:28 pm

I don’t know anything about the business of publishing, but I use CSS, and this was interesting to me,

“This book has been written entirely in HTML and CSS.”
… page 353, Chapter 18 – The CSS Saga, the last sentence of the last paragraph.

Cascading Style Sheets: Designing for the Web, 3/E
Hakon Wium Lie
Bert Bos,
ISBN-10: 0321193121
ISBN-13: 9780321193124
Publisher: Addison-Wesley Professional
Copyright: 2005
Format: Paper; 416 pp
Published: 04/25/2005

http://www.pearsonhighered.com/educator/product/Cascading-Style-Sheets-Designing-for-the-Web/9780321193124.page#takeacloserlook

The sample chapter is “Chapter 3 The amazing em unit and other best practices”
http://www.pearsonhighered.com/assets/hip/us/hip_us_pearsonhighered/samplechapter/0321193121.pdf

I’m not a poet, but I humbly suggest that markup – HTML, XHTML, or XML, should be easy for a poet. Or for each poem.

To me, from a formatting standpoint, a poem has a lot to do with where the line ends and the “paragraphs”.

“Music is the space between the notes.” Debussy

In this case, I would “try” surrounding the poem with a div and an id. Then use classes with repeating block elements to make related sets of lines look “right”.

If I were generating the content, I might try writing in Notepad without markup tags, and separating each set of lines with a pair of carriage returns and linefeeds. Then, when done, I would turn word wrap off, and save.

This would make if easy to add the markup at the beginning of each “line” or “paragraph”. And somewhat simple to add the closing tag.

Then, into my web page, with something like –
<body>

<!– close poem>
</body>
… Cut and Paste from the notepad file.

Perhaps the classes might be defined like –
#poem p.haiku
#poem p.limerick
#poem p.the-bop

Just a thought.

Sometimes it can be straight-forward to do a stand-alone markup document. As compared to a multitude of documents and media types in a web site.

Have a good weekend.
elmimmo says:

October 22, 2010 at 5:22 am

bq. Dashes.”ƒAs commonly used in print books, em dash (—) with no spaces on either side does not work in onscreen text.

Not only that, in Spanish you just do _not_ use em dash (—) with no spaces on either side. Em dashes —if used correctly in that language, that is— function just like parenthesis if in the middle of the sentence —although you do not close them at the end of one.

Even if the above is incorrect usage in English, I just wanted to illustrate.

They require a space before the interruption (and after, if there is no period), and just like parenthesis, you absolutely do *not* break the line between it and the letter that sits next to it.

Unfortunately, “Unicode’s Line Breaking Algorithm”:http://unicode.org/reports/tr14/ is English centric (booh!) and says that em dash “provides a line break opportunity before and after the character”, a complete aberration in Spanish typesetting (should be the exact opposite). As a result, pretty much any engine that displays text on screen, modern or old (including of course any browser or ebook reader out there) is chopping lines in Spanish text leaving orphan em dashes at the end of lines. No single ebook or webpage is surviving this. Unless one goes and manually litters all em dashes with zero width no-break spaces at both sides, which is rather gross.
matt bear says:

May 25, 2011 at 9:11 am

bq. “Peter’s comment”:http://www.alistapart.com/comments/ebookstandards/P10/#16
We need this also as a base to support annotations. Everyone (Adobe, Stanza, Ibis, Apple) seems to be introducing proprietary solutions for annotations, in some flavor of XML (that gets written into a file and added to the epub manifest? or which lives only in the reader?), but eventually we want to share and aggregate annotations.

bq. “Joe’s comment”:http://www.alistapart.com/comments/ebookstandards/P10/#20
Nonetheless, the problem of a defined structure for annotations is real and rather pressing for production workflow. (Why else are people addicted to the methadone of MS Word?) I have no solution, but then again, I can’t be expected to have one.

Supporting “marginalia”:https://secure.wikimedia.org/wikipedia/en/wiki/Marginalia (annotations, etc.) has been a goal for a long time, and not just for ePubs. See these references:
* “Seeing the picture – Crowdsourcing annotations for books (and eBooks)”:http://blog.lib.uiowa.edu/hardinmd/2009/06/08/crowdsourcing-annotations-for-books-and-ebooks/
* “From Personal to Shared Annotations”:http://www.csdl.tamu.edu/~marshall/CCM-AJB.pdf
* “Social Annotations in Digital Library Collections”:http://www.dlib.org/dlib/november08/gazan/11gazan.html”

“How to express and exchange annotations”:https://github.com/nichtich/marginalia/wiki/Support-of-PDF-annotations focuses on PDF annotation methods.

“The Fascinator”:https://fascinator.usq.edu.au/trac/wiki/Annotate/existing also has some information, as does “WikiPedia’s Web Annotation article”:https://secure.wikimedia.org/wikipedia/en/wiki/Web_annotation

bq. “ncarr’s comment”:http://www.alistapart.com/comments/ebookstandards/P20/#25
This can be over come by simply adding unique identifiers to the document objects for use as anchors. This is almost a job for microformats”¦

I disagree. I think it should be based on “DocBook”:http://www.docbook.org/ or some other XML format. (In DocBook, it’s a solved problem.) DocBook has support for several missing features of ePub: ,

, , ,

, , , (a collection of books-like an encyclopedia or _The Art of Computer Programming_), as well as support for “MathML”:http://www.w3.org/Math/ and “SVG”:http://www.w3.org/Graphics/SVG/ .

bq. Still, as simple as the solution is, it would be great if someone would take the lead and publish some basic conventions so that the problem didn’t have to be solved and re-solved over and over and over again. The citation industry is both progressive and vibrant but it is geared towards problems a lot bigger than putting a few footnotes in an ebook.

I agree. Having some high quality examples would be beneficial.

bq. “Daniel Bennet’s comment”:http://www.alistapart.com/comments/ebookstandards/P30/#38
First there should be a URL that is associated with the publication. It may be that there are thousands of posted versions of public domain books, but each should have a corresponding URL. On the possibility that each version should have differences, intentional or not, having a separate URL for each instance is important.

This actually exists. See “Document Object Identifier”:https://secure.wikimedia.org/wikipedia/en/wiki/Digital_object_identifier (though this also “has problems”:https://secure.wikimedia.org/wikipedia/en/wiki/Baen_Books#Baen_Digital_Object_Identifiers_.28DOI.29 .)
Beginner eBook Publishing says:

June 15, 2012 at 6:58 pm

This is by far one of the most well written and thoroughly researched articles on ebooks and html. Having a little knowledge in html can make a huge difference in creating ebooks. Without it good luck keeping the formatting of your original document. Keep up the great work!

Got something to say?

We have turned off comments, but you can see what folks had to say before we did so.

More from ALA

Good designers, bad websites: a proposal

by Alan Dalton

Designers are good people. Some designs exclude people anyway. Alan Dalton offers a practical fix: accessibility personas that help you recognize problems while you're designing, not after. Homework included.

“Successful” or “Unsuccessful”: the Post-“Good Design” Vocabulary

by Justin Dauer

Design for Amiability: Lessons from Vienna

by Mark Bernstein

Computing was born in a Viennese café. Between 1928 and 1934, while Hitler plotted and Europe crumbled, a motley crew of mathematicians, philosophers, architects, and economists gathered weekly to puzzle out the limits of reason—and invented Computer Science in the process. What made their collaboration possible wasn't just brilliance (though they had plenty). It was amiability: the careful design of a social space where difficult people could disagree without destroying each other. Longtime A List Apart contributing author Mark Bernstein mines this forgotten history for lessons that might just save today's embattled web from its worst impulses. Spoiler: it involves better coffee service and the looming threat of public humiliation.

Design Dialects: Breaking the Rules, Not the System

by Michel Ferreira

Design systems aren't component libraries—they’re living languages. Rigid adherence to visual rules creates brittle systems that break under contextual pressure. Fluent systems bend without breaking.

An Holistic Framework for Shared Design Leadership