Web Standards for E-books

by Joe Clark

49 Reader Comments

Back to the Article
  1. Very interesting article, it shows you were born in that jungle. This is why you shoud visit www.makeyourebooks.com. We got the XML tagging to the industrial processes. I am sure you will be interested.

    Copy & paste the code below to embed this comment.
  2. Great article. Maybe it’s just a subtle difference in semantics, but I feel the use of the term “online” is too restrictive; I would use the term “digitally” instead. When people read ebooks, they won’t necessarily be online (i.e., connected to the internet). Hence:

    HTML isn’t just for the web… It’s for any text distributed online digitally.”

    HTML is the preferred way to mark up and publish online digital documents that are not websites.”

    Copy & paste the code below to embed this comment.
  3. Hi. I’ve been producing ebooks since 1998, so I may have something worth contributing to this debate. I began way back then with the premise that one could use HTML to format a book so as to make it readable online. This premise was borne out of an aesthetic reaction to reading in plain text.

    I quickly adopted a second premise: that these ebooks would be new editions, rather than attempting to be facsimiles of a particular print edition.

    I’ll be the first to admit that my earliest efforts were quite horrible. (Although I thought they were good at the time.) HTML by itself is inaqequate to the task. But HTML with CSS produces—I believe—results that are the equal of print for the vast majority of books. I’m slowly upgrading the older editions to match my current standard.

    To address some of the issues raised:

    1. Numbering. Page numbers make no sense in ebooks (as new editions) because there are no pages. They are an artifact of print. However, we are so used to being able to reference text via page number, that many readers are lost without them, no matter how many times I say it’s perfectly OK to cite using a URL.

    In my ebooks, chapters are numbered, as are sections, parts and anything else likely to be referenced in a ToC. But I don’t number paragraphs (although it would be easy to add ‘id=“n”’ to each para.) because I don’t see an easy, unobtrusive way to inform the user of their existence.

    2. Sections. My approach here is to simply use a DIV wrapper to each section (or chapter, …) with a class attribute. E.g. your example of spacing between sections is easily achieved with <div class=“section”> and “section { margin-bottom:2em; }”

    To that extent, I don’t believe HTML lacks structural features, since you can define them arbitraily as needed.

    3. I note the appearance in the comments of various suggestions as to why HTML is not as good as TeX/Docbook/etc. This is depressing because I recall similar discussions from ten years back.

    ePub is the best format for ebooks because, unlike all those other formats, HTML is ubiquitous, and easy enough for authors to manage. Or, to be completely minimal, authoring can be done in plain text, which can then be easily converted to ePub/HTML in a largely automated process. I do this daily with books from Project Gutenberg.

    4. I would argue that many of the typographical conventions are actually “house style”—e.g. the use of small caps. Much can be managed with css.

    5. XHTML/CSS can handle quite complex formatting—e.g. positioning of text around images, margin notes.

    If interested, my ebooks are available at http://ebooks.adelaide.edu.au
    I do not claim to have everything right. But I welcome constructive criticism.

    Copy & paste the code below to embed this comment.
  4. Now, the foregoing is so optimistic as to be ridiculous. Authors are not going to start writing in HTML, let alone the full-on XML that Ben Hammersley has called for. Book copy will continue to be saved as MS Word, Xpress, and/or InDesign files. Though mangled and inadequate, such copy will then be “exported” for E-book “formatting.”

    I believe Ben’s article has been taken down.

    I believe those still in confusion between the writing creative process and the technical (although not destitute of its own creativity) book creation process would benefit from reading on how writers write.

    I believe that after a few testimonials one will see that with matters of heart and inspiration, technology is more than welcome to be useful but only whilst staying ubiquitous:

    J.G. Ballard: How I write:
    http://entertainment.timesonline.co.uk/tol/arts_and_entertainment/books/article439694.ece

    How I Write by Bertrand Russell
    http://www.solstice.us/russell/write.html

    How I Write by Richard Milward
    http://www.faber.co.uk/article/2009/2/how-i-write/

    Neil Gaiman: how I write
    http://www.timeout.com/london/books/features/2100/Neil_Gaiman-how_I_write.html

    http://www.thewritingcentre.com/how-i-write

    It still is all about the users, even when they are the authors themselves.

    Cheers!

    Copy & paste the code below to embed this comment.
  5. so, joe clark has joined the .epub adherents proclaiming
    that all other formats, heretofore and in the future, must
    die.  yawn.  html, with or without the x, is already fading.

    many of us are already in the process of making better
    formats, and we ain’t stoppin’ just ‘cause joe says so…

    look at the comment box for this very blog article.

    it allows textile, because authors like light-markup
    —because it stays out of our way when we’re writing.

    and sure, then we convert it to (x)html, and from there
    it might get shuttled on to .epub format, but why?

    why not just have the rendering agent take textile input,
    convert it to (x)html itself, and then to .epub if it must,
    and then display it.  why do we have to do all of these
    conversions, that the machine can do just as well itself?)

    the answer is that we don’t.

    we feed the machine textile (or another light markup).

    and once the machine realizes that it could render our
    textile file, as raw input itself, and render it as easily as
    (x)html or .epub, it won’t even bother to do conversions.

    make all the proclamations you want, joe.

    we ain’t listening.  and we ain’t stopping.

    we’re making the future, and you can’t fight it.

    -bowerbird

    Copy & paste the code below to embed this comment.
  6. Hi bowerbird,

    I find Joe Clark the kind of writer that forces me to re-read things constantly, he’s got a writing style that declares the most absurd things upfront, factually and without prejudice, and then discusses if they are relevant/feasible/sensible or not, establishing exceptions just later.

    But I think he is not really saying that HTML will serve all books as a single distribution solution:

    ***

    HTML doesn’t work for all documents, since it lacks important structural features. (HTML5 addresses some of those deficiencies but won’t help today’s E-books.) HTML does work for huge numbers of documents, many of which we call books. Bet against HTML for online distribution and you’ve backed the wrong horse.” – Joe Clark

    ***

    As far as I can see, HTML can serve most books out there but it might not serve those books whose text formatting is also part of the narrative as it happens with certain styles of poetry where the space and arrangement of the words on the canvas/paper also convey meaning, as it is the case of concrete poetry, a poetic style still very popular in my motherland Brazil (http://en.wikipedia.org/wiki/Concrete_poetry).

    I believe the article doesn’t exclude the possibility of those literary exceptions that make literature a 4 dimensional experience, an experience which HTML still can’t represent easily.

    I am not totally against the creation of a full-blown XML language for literary content that tried to close this gap instead of HTML (as long as we don’t assume the Philistine attitude that writers must write under the terms of a technical format standard).

    Nevertheless, in the current context, HTML can still be useful for a great range of standard publications and books. In my humble opinion.

    Cheers,

    Luis

    Copy & paste the code below to embed this comment.
  7. I enjoyed the article — it’s exactly what needs to be said. We’re in the middle of an illusory “backlist goldrush” at the moment, meaning digitize and damn the quality, but the glut of bad ebooks will soon prompt publishers to think they can differentiate big releases on quality again. I don’t think it’ll take too long.

    Anyway, a correction on the NY Times reference. The linked article says: “So on a $12.99 e-book, the publisher takes in $9.09. Out of that gross revenue, the publisher pays about 50 cents to convert the text to a digital file, typeset it in digital form and copy-edit it. Marketing is about 78 cents.”

    It’s kind of mind-boggling that one could read that and think it meant 50 cents total. It’s 50 nominal cents per copy sold.

    But otherwise, a great article. More of this please, ALA.

    Copy & paste the code below to embed this comment.
  8. There have been comments about how to render footnotes within .epub/xhtml. Although this is important, a corresponding feature should be to allow the ebook to be cited, especially with non-fiction. Generally, it is simple to reference a paper copy, but citing an electronic version is oddly trickier.

    First there should be a URL that is associated with the publication. It may be that there are thousands of posted versions of public domain books, but each should have a corresponding URL. On the possibility that each version should have differences, intentional or not, having a separate URL for each instance is important.

    Second, the publisher should make sure that each portion that might be cited should have an id attribute built into the document. That could be based on div or paragraph tags, or even span tags.

    Third, there should be an obvious way to show the embedded id so that anyone citing or copying a small portion can easily include the citation back to the document being cited. I came up with a simple way using standard HTML tags that allow this, that I call Embedded Self Cites. The citation would be a URL that would include an HTML fragment reference (e.g. http://ebook.example/booktitle#chapter1para4 ). (see http://advocatehope.org/tech-tidbits/embedded-citations )

    In addition to using a URL for a citation, each ebook, ebook abstract or ebook portion that can be cited with a URL, could be printed on paper with a QR barcode for URL that would allow most smart phones or ebook reader with camera to scan in the URL barcode and go to the ebook directly or a pay page/log in. (see: http://docs.google.com/Doc?docid=0AV6jPr0LRFa0ZGZ4Z2NkZmNfODZnZmpnbXJkdg&hl=en_GB#Bar_Codes )

    - Daniel Bennett

    Copy & paste the code below to embed this comment.
  9. Great help on e-books, i find e-books can be great for learning online and with modern applications such as iPhones they are easy accessible.

    Copy & paste the code below to embed this comment.
  10. Dead-on commentary, Joe. Since the early days of computer typesetting, publishers have been using some form of simple inline markup for formatting. You’re so right to say that HTML is the evolutionary heir to all that.

    People need not be daunted by the apparent complexity, either. As you say, it’s only a limited subset of HTML that’s needed for most E-books. Learning HTML is almost unnecessary for publishers if they adopt the use of “Markdown (a text-to-HTML conversion tool for web writers)”:http://daringfireball.net/projects/markdown/ (I have no personal interest in Markdown, which is offered free, but as soon as I saw it I recognized its value).

    As always, Joe, I really appreciate your insights.

    Copy & paste the code below to embed this comment.
  11. Sigil is an open source project which is also challenging InDesign. It is a WYSIWYG editor which runs on Mac, Windows, and Linux.

    Copy & paste the code below to embed this comment.
  12. I was dismayed that in your article about eBook Standards, you did not address the issue of eReaders deliberately not following the spec in their benevolent but misguided attempt to display ebooks more attractively to readers (as described at the beginning of this article: http://www.pigsgourdsandwikis.com/2010_03_01_archive.html)

    I think it is absolutely essential for Web standardistas to stand up right now and call for eReaders to follow those specs. Otherwise, it is only a matter of time before someone does develop an eReader that listens to the book designer regardless of the standards (Netscape anyone?) But by then it will be too late.

    Further, I am so incredibly tired of people being paternalistic about what should be allowed and what shouldn’t. Whether you think video in a book is useful is not relevant to eBook standards. I can think of many instances, particularly in technical books, in which case a short video tutorial would be much more instructive than a series of screenshots.

    And Apple says not to use fonts when designing eBooks for the iBookstore, because it “creates a bad user experience”. Excuse me? Get your hands off my design. If you don’t like it, don’t buy it.

    Regardless, it’s not for you, Apple, or anyone else to decide on aesthetic grounds. Either it follows the standards, or it does not.

    (and frankly, it’s a pain constructing a complicated rebuttal in this tiny text box!)

    Copy & paste the code below to embed this comment.
  13. When I posted initially I hadn’t actually looked at any ebooks, which is why the idea that they should be done in HTML seemed so bizarre to me. Since then I’ve acquired an iPhone (not an iPad) and have downloaded a few free ebooks in different formats, and the eReader. I can see now why Mr Clark was going on about em-dashes and block paragraphs (hatred of the latter being one area I agree with him) – and to which I’d add dumb quotes – because the typographic execution in the examples I’ve seen (and those in screen shots from paid books) varies from poor to dreadful. I certainly couldn’t bring myself to read anything on my iPhone set like that, and I see little prospect of any improvement, except perhaps on the iPad, where the Kennedy book that Jobs presented may raise expectations. Mr Clark seems to have got his wish for the triumph of HTML in eBooks. He should have perhaps been a bit more careful what he was wishing for.

    Copy & paste the code below to embed this comment.
  14. Thanks Joe Clark. Good points. Many times change for the good means diminishing status for icons.

    Thanks a list apart. Another fine topic well presented.

    With respect to ebooks. Purchase two books a couple months ago from O’Reilly.

    Automating System Administration with Perl, 2Ed
    CSS Cookbook, 3Ed

    Purchased – both print and ebook. It was bundle, and I was curious. For my purposes, with Acrobat 8 Pro, the PDF version has more utility than EPUB. At least for now. BTW, I don’t have a reader.

    I viewed the EPUBs with Adobe Digital Editions 1.7.4 (I think) on a Vista PC.

    But for those interested, I looked into the EPUB files with PKZIP. Perhaps some of the other commenters may be interested.

    I unzipped the EPUB file and looked inside the files with Wordpad. There’s both HTML and XML inside.

    Eg.for Automating System Administration with Perl, 2Ed
    There’s XML here –
    in OEBPS/content.opf
    in OEBPS/toc.ncx
    in META-INF/container.xml
    and HTML in
    in OEBPS/index.html
    Kind of quirky, it looks like this –

    <?xml version=“1.0” encoding=“UTF-8” standalone=“no”?>
    <!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.1//EN” “http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd”>
    <html ><head>
    … etc.

    EPUB looks more like a hybrid.

    I tried looking here too –
    OEBPS Container Format (OCF) 1.0 Specification Documents
    HTML Version
    http://www.openebook.org/ocf/ocf1.0/download/ocf10.htm
    See APPENDIX B: Example

    Have a good week.

    Brian

    Copy & paste the code below to embed this comment.
  15. I’m a writer, a poet at least. My current project, though—a custom document management system to resolve some bad version control problems I run into as I write at home and on the road—led me to see what ALA had to say about the topic.

    While I agree with the value of HTML for presenting these documents (which represents the potential for a massive open-standards win over the proprietary systems to which we lend our literary art and scholarship) I’m not sure the XHTML-now approach is the best. I’m inevitably biased by my own proclivities, but I think it’s easier to expect writers, editors and publishers (“book people” as Clark says) to produce well-formed XML with semantic markup in terms they already understand.

    I don’t think it’s hard to convince a poet that she can write his poems in XML with tags like and <TITLE> and <STANZA> and <LINE> but it’s going to be a much greater leap (and will, undoubtedly, be pushed off on someone else) to expect her to understand how to use HTML’s not-quite-perfect-yet markup in a semantic manner. Even if a few variations of the XML tags crop up, the sets used will be relatively simple, and this sort of parsing made relatively trivial.

    Let writers and publishers invoke their natural vocabularies through XML, and let the standardistas make decisions about how to parse this XML into display-ready XHTML as necessary. Then our parsers can adapt to the further evolution of HTML as a language (and common practice in e-publishing, no doubt) by continually adapting comprehensive XML documents into the markup that makes the most sense today, and markup which makes sense 20 years from now.

    Copy & paste the code below to embed this comment.
  16. I don’t know anything about the business of publishing, but I use CSS, and this was interesting to me,

    “This book has been written entirely in HTML and CSS.”
    … page 353, Chapter 18 – The CSS Saga, the last sentence of the last paragraph.

    Cascading Style Sheets: Designing for the Web, 3/E
    Hakon Wium Lie
    Bert Bos,
    ISBN-10: 0321193121
    ISBN-13:  9780321193124
    Publisher:  Addison-Wesley Professional
    Copyright:  2005
    Format:  Paper; 416 pp
    Published:  04/25/2005

    http://www.pearsonhighered.com/educator/product/Cascading-Style-Sheets-Designing-for-the-Web/9780321193124.page#takeacloserlook

    The sample chapter is “Chapter 3 The amazing em unit and other best practices”
    http://www.pearsonhighered.com/assets/hip/us/hip_us_pearsonhighered/samplechapter/0321193121.pdf

    I’m not a poet, but I humbly suggest that markup – HTML, XHTML, or XML, should be easy for a poet. Or for each poem.

    To me, from a formatting standpoint, a poem has a lot to do with where the line ends and the “paragraphs”.

    “Music is the space between the notes.” Debussy

    In this case, I would “try” surrounding the poem with a div and an id. Then use classes with repeating block elements to make related sets of lines look “right”.

    If I were generating the content, I might try writing in Notepad without markup tags, and separating each set of lines with a pair of carriage returns and linefeeds. Then, when done, I would turn word wrap off, and save.

    This would make if easy to add the markup at the beginning of each “line” or “paragraph”. And somewhat simple to add the closing tag.

    Then, into my web page, with something like –
    <body>
    <div id=“poem”>

    </div> <!—close poem>
    </body>
    … Cut and Paste from the notepad file.

    Perhaps the classes might be defined like –
    #poem p.haiku
    #poem p.limerick
    #poem p.the-bop

    Just a thought.

    Sometimes it can be straight-forward to do a stand-alone markup document. As compared to a multitude of documents and media types in a web site.

    Have a good weekend.

    Copy & paste the code below to embed this comment.
  17. Dashes.”ƒAs commonly used in print books, em dash (—) with no spaces on either side does not work in onscreen text.

    Not only that, in Spanish you just do not use em dash (—) with no spaces on either side. Em dashes —if used correctly in that language, that is— function just like parenthesis if in the middle of the sentence —although you do not close them at the end of one.

    Even if the above is incorrect usage in English, I just wanted to illustrate.

    They require a space before the interruption (and after, if there is no period), and just like parenthesis, you absolutely do not break the line between it and the letter that sits next to it.

    Unfortunately, “Unicode’s Line Breaking Algorithm”:http://unicode.org/reports/tr14/ is English centric (booh!) and says that em dash “provides a line break opportunity before and after the character”, a complete aberration in Spanish typesetting (should be the exact opposite). As a result, pretty much any engine that displays text on screen, modern or old (including of course any browser or ebook reader out there) is chopping lines in Spanish text leaving orphan em dashes at the end of lines. No single ebook or webpage is surviving this. Unless one goes and manually litters all em dashes with zero width no-break spaces at both sides, which is rather gross.

    Copy & paste the code below to embed this comment.
  18. “Peter’s comment”:http://www.alistapart.com/comments/ebookstandards/P10/#16
    We need this also as a base to support annotations. Everyone (Adobe, Stanza, Ibis, Apple) seems to be introducing proprietary solutions for annotations, in some flavor of XML (that gets written into a file and added to the epub manifest? or which lives only in the reader?), but eventually we want to share and aggregate annotations.

    “Joe’s comment”:http://www.alistapart.com/comments/ebookstandards/P10/#20
    Nonetheless, the problem of a defined structure for annotations is real and rather pressing for production workflow. (Why else are people addicted to the methadone of MS Word?) I have no solution, but then again, I can’t be expected to have one.

    Supporting “marginalia”:https://secure.wikimedia.org/wikipedia/en/wiki/Marginalia (annotations, etc.) has been a goal for a long time, and not just for ePubs. See these references:

    • “Seeing the picture – Crowdsourcing annotations for books (and eBooks)”:http://blog.lib.uiowa.edu/hardinmd/2009/06/08/crowdsourcing-annotations-for-books-and-ebooks/
    • “From Personal to Shared Annotations”:http://www.csdl.tamu.edu/~marshall/CCM-AJB.pdf
    • “Social Annotations in Digital Library Collections”:http://www.dlib.org/dlib/november08/gazan/11gazan.html”

    “How to express and exchange annotations”:https://github.com/nichtich/marginalia/wiki/Support-of-PDF-annotations focuses on PDF annotation methods.

    “The Fascinator”:https://fascinator.usq.edu.au/trac/wiki/Annotate/existing also has some information, as does “WikiPedia’s Web Annotation article”:https://secure.wikimedia.org/wikipedia/en/wiki/Web_annotation

    “ncarr’s comment”:http://www.alistapart.com/comments/ebookstandards/P20/#25
    This can be over come by simply adding unique identifiers to the document objects for use as anchors. This is almost a job for microformats”¦

    I disagree. I think it should be based on “DocBook”:http://www.docbook.org/ or some other XML format. (In DocBook, it’s a solved problem.) DocBook has support for several missing features of ePub: <chapter>, <section>, <sidebar>, <equation>, <figure>, <footnote>, <annotation>, <set> (a collection of books-like an encyclopedia or The Art of Computer Programming), as well as support for “MathML”:http://www.w3.org/Math/ and “SVG”:http://www.w3.org/Graphics/SVG/ .

    Still, as simple as the solution is, it would be great if someone would take the lead and publish some basic conventions so that the problem didn’t have to be solved and re-solved over and over and over again. The citation industry is both progressive and vibrant but it is geared towards problems a lot bigger than putting a few footnotes in an ebook.

    I agree. Having some high quality examples would be beneficial.

    “Daniel Bennet’s comment”:http://www.alistapart.com/comments/ebookstandards/P30/#38
    First there should be a URL that is associated with the publication. It may be that there are thousands of posted versions of public domain books, but each should have a corresponding URL. On the possibility that each version should have differences, intentional or not, having a separate URL for each instance is important.

    This actually exists. See “Document Object Identifier”:https://secure.wikimedia.org/wikipedia/en/wiki/Digital_object_identifier (though this also “has problems”:https://secure.wikimedia.org/wikipedia/en/wiki/Baen_Books#Baen_Digital_Object_Identifiers_.28DOI.29 .)

    Copy & paste the code below to embed this comment.
  19. This is by far one of the most well written and thoroughly researched articles on ebooks and html. Having a little knowledge in html can make a huge difference in creating ebooks. Without it good luck keeping the formatting of your original document. Keep up the great work!

    Copy & paste the code below to embed this comment.