Web Standards for E-books

by Joe Clark

49 Reader Comments

Back to the Article
  1. I’m currently researching the possibilities in e-books and this article helped me to clear out the answer i should give our clients.

    Copy & paste the code below to embed this comment.
  2. E-books break the page numbers and the topic was not discussed.
    I saw no good solution yet to that issue.
    How can we get rid of the page numbers for a human ?
    Should we get rid of the page numbers ?

    Copy & paste the code below to embed this comment.
  3. Page numbers almost certainly need to go.

    Page numbers are tied to a certain fixed page size. E-books do not and should not have fixed page sizes — the user may want to change the font size (and the font face, too) or read the book on a larger screen. So if page numbers were to stay, they would have to be re-generated on the fly.

    Suppose now reader software allows you to scroll one line up. Are you now on page 199.98?

    Also, having page numbers allows one to refer to them. “In the book “˜War and Peace’, page 472”¦” Page 472 in which page size, font size, and page orientation? Even for print books, such a reference has to be disambiguated with a specific print.

    Page numbers are meaningless. Instead, we should refer to chapters, sections, subsections, paragraphs and sentences; and, for well-structured text, to anchors.

    Tables of content and indexes should better be handled by reader software (what’s so difficult in collecting all headings from a correctly marked-up book?), and cross-references can use anchors.

    Copy & paste the code below to embed this comment.
  4. The concept of standard is very away from editorial production logic. And many printed books have complex layout, like school books or manuals. The production of contents is not linear, and dont rely to Word or other tools. The final concent is assembled in a DTP program like XPres or Indesign.
    For this books, simply XHTL/CSS is not enough. And also working in a clean manner with Indesign result is a bet. And also if code is valid, is necessary verify all because images flow around, or at bottom of document.
    So, this approach work only with simple design, linear text and few or no one image.

    Copy & paste the code below to embed this comment.
  5. Thank you for the discussion—you’ve cleared up several questions I have as a potential content provider.

    I’ve been trying to research app standards in regard to graphics files, as well. ARE there any as to resolution, pixel dimension, screen ratios, format, etc.? Do standard web screen resolutions apply for teeny-tiny viewing screens?

    We picture book illustrators are used to a standard print format of 32 pages—is there any corresponding “average” length (number of screens?) for original content?

    Thanks!

    Copy & paste the code below to embed this comment.
  6. The non-Shortcovers EPUB edition of The City & The City I read didn’t have any of the problems you described.  It appeared to have been produced by InDesign (implying some degree of manual work) and even embedded a font which contain glyphs for all of the unusual characters used.

    Not that ebook production doesn’t have problems right now, but I think Shortcovers is a bit worse than the norm.

    Copy & paste the code below to embed this comment.
  7. Putting aside the question of whether or not page numbers are useful for e-books, it’s not true that e-books do not support fixed page numbering.  Epub provides a method to map the specific page numbers so that the e-book page numbers will correspond to print page numbers (see here: http://blog.threepress.org/2009/11/26/adobe-page-map-versus-ncx-pagelist/).  Since we’re talking about standards, I think this is worth addressing.  Page numbers are not meaningless—they have a definite use in citing references, which is important for any type of scholarly work.

    Copy & paste the code below to embed this comment.
  8. For what could be called pure electronic books, even if there is also a printed book, page numbers indeed don’t make any sense. An index or table of contents has to use hyperlinks, not page numbers. (And don’t think an index isn’t useful. I assure you that a search function is not one-tenth as useful.)

    However, for alternate formats that are meant to be a conceptual duplicate of a printed book, you do need to encode page numbers somehow. The standard example is books for the blind. Large print never does this, but Braille books can print the original page number and the Braille page number on each sheet; analogue talking books play a tone when the reader turns a page (this is outdated, obviously); DAISY electronic talking books can notate original printed page in various ways, though it’s been a while since I read that spec.

    I grant this causes complications when you want to write a bibliographic citation for a “book” when all you have is the E version and your readers are almost certainly going to be looking at the P version.

    Copy & paste the code below to embed this comment.
  9. Livio, you are articulating a well-accepted position, but there are flaws in your foundation.

    A printed page made for a reader with no relevant disabilities is a random-access medium. You can look anywhere you want and read anything you want in any order, or just put the book on the floor and admire a double-page spread.

    Electronically, the issue becomes reading order (“logical” reading order in the terminology). Do the contents, when read start to finish, make sense? If you jump in at a certain point and read from there to the finish, does it make sense?

    In InDesign, proper threading order of text frames results in (e.g.) a tagged PDF with a logical reading order. It is true that the designer must make a decision as to when the reader is to experience a callout or a sidebar. My experience is this is only occasionally a real cause for debate.

    The same applies to actual E-books. You need a logical reading order. ePub allows CSS placement of callouts and sidebars, which could instead (or in addition) be separate files.

    Not every aspect of print graphic design can be duplicated in electronic document design, which relies fundamentally on structure, not inferences drawn from appearance.

    Copy & paste the code below to embed this comment.
  10. To so sure that an “author” should be making the decisions on which type of space character should be used in a given context. Seems like that should be handled by the rules engine that proofs the manuscript for well-formedness.

    mxt

    THINK
    think different
    Think Open Source


    “I need more space” – Creature Comforts

    Copy & paste the code below to embed this comment.
  11. More accurately, authors shouldn’t be allowed to use the wrong characters. That’s why we need editors, the first ones to be fired (or cheapened by outsourcing or hiring greenhorns straight out of university).

    Copy & paste the code below to embed this comment.
  12. Everyone complains about full-justified text in E-readers (text with straight left and right margins)

    I guess I’m in the minority then. I much prefer justified text, and lament its absence on most websites. Things just look cleaner when that right margin is aligned.

    Copy & paste the code below to embed this comment.
  13. Page numbers definitely need to go…

    Copy & paste the code below to embed this comment.
  14. Earlier this evening I made a Twitter (@epub) response to this article which was, “Great article from ALA…Web Standards for E-books…but I still have doubts about HTML”. Joe responded by asking what doubts – perhaps I didn’t use my 140 Twitter characters well enough – 140 characters probably aren’t enough anyway.

    In essence I agree with most of what you say Joe, though it’s all those “issues” that is the reason why I said I have doubts.

    I’d put the argument of HTML for eBooks in to two camps; Yes for the indie author, enthusiast or small publisher and No for the big publishers.

    Let’s face it, those indie authors just want to get their books out there; they certainly don’t want to go learning some new-fangled XML markup. They know HTML, they’ve been reading ALA so are up on standards ;-) and they are happy to continue that way.

    The big publishers though (who are ultimately the important sector as they are making vast amounts of money from selling these eBooks and so should be getting it right) will need to consider something more structured and which is designed specifically for marking up the book language. This would be where a language like DTBook would give them much better control of the structural elements, which ultimately will give the reading system (eReader) better control over how to display that text. If the eReader sees a footnote tag, then it knows exactly what it is and can apply it’s rendering appropriately, rather than having to guess as to what the attribute “foot” or “fn” or even “ftnt” means — I believe we should use CSS only as a guide for the eReaders, as perhaps the end user may wish to use their own custom styling.

    …I’m doing that right now by downplaying the importance of XML and DTBook variants of ePub.

    Personally I think we should be playing-up the importance of DTBook. It is certainly not some new upstart and could been a great language for publishers to use on their titles, especially with its use in accessibility circles, but I have to say that I’m not aware of anyone having actually released an EPUB book with DTBook under the bonnet – not very positive.

    You wrote a long article Joe and it’d take me another long article to reply to all what I’d like. To wrap up;

    I feel a good Industry Standard XML Markup (DTBook) would cause far fewer issues than what would be produced from using HTML (I should note that I haven’t yet looked at HTML5..! so can’t comment on that) though an eBook format needs to also support HTML for the smaller developers. The big publishers, who will need to use mashups, extracts, etc., in a much bigger way, should probably go down the XML route.

    I’m a big supporter of EPUB because it supports both these.

    Copy & paste the code below to embed this comment.
  15. The biggest problem with ebooks is that they’re being produced by technicians who just need to get them out the door. Sometimes I wonder if they’re paid by the kilobyte or something. Publishers seem to be having a hard time hiring people who know about typography and know how to produce well-crafted markup code and are willing to check their results and fix the mistakes.

    Open up a book from Penguin and you’ll find a massive 38kB css file containing every class the designer has ever used, each helpfully named with an obscure code. Just in case anyone thinks this might be a central cache of house styles, the values defined in the classes often change from book to book, usually with extra styles tacked on at the end with their own cryptic name. In one case a 34kB core repository of hermetic mystery is helped along the way by another 109 separate css files, 3 for each chapter, and most of which are empty.

    Random House US does at least use meaningful names, but its early attempt to maintain a standardised list of styles is groaning under the weight of an ever-lengthening list of ‘Added Styles’. To be fair, RH UK does seem to have avoided the blight and produce tailored styles.

    At least one technician working for macmillan doesn’t understand the difference between ISO-8859-1 and UTF-8 and why the former should never be allowed anywhere near an ePub.

    I could go on, but you get the point.

    It’s no wonder the end result suffers. The principles for producing well-crafted maintainable code apply to xhtml/css just as much as to any other form.

    Each book deserves to be crafted with the same care. Develop a house-style and stick to it. Import only the styles you need and add new styles only when needed. Above all, check the output. Check it on the desktop version of ADE and check it on an actual reader.

    Copy & paste the code below to embed this comment.
  16. As a publisher and epub newbie I’ve got lots of concerns and questions about epub esp its ‘default’ xhtml flavor, aside from lack of quality in implementation. Here are a few:

    I think we should try and support pages, if only to make references from print map onto our ebooks. Joe linked to the ThreePress “blogpost”:http://blog.threepress.org/2009/11/26/adobe-page-map-versus-ncx-pagelist that mentions two approaches, from Adobe and the epub standard-compliant method via NCX/DTBook. But Keith of ThreePress says that no reading systems he knows of support the standards-compliant method.

    We definitely also need a chapter>paragraph>sentence>(word?) reference system for ePub. I would be very interested in discussions of different approaches here. Endless divs and spans with unique ids? 

    We need this also as a base to support annotations. Everyone (Adobe, Stanza, Ibis, Apple) seems to be introducing proprietary solutions for annotations, in some flavor of XML (that gets written into a file and added to the epub manifest? or which lives only in the reader?), but eventually we want to share and aggregate annotations.

    In his recent NYBooks “piece”:http://www.nybooks.com/articles/23683, Jason Epstein suggests that only physical books can retain a morally-important kind of authority (of authorship, of time) that texts need. Only print can provide against the insults of centrally-controlled DRM,  ease of changing digital files, etc. He’s got a point. We’ve got a lot of work to do to make ebooks not only convenient to read, but also authoritative records of texts at particular points of time. Records which can, like print books, acquire their own histories.

    Copy & paste the code below to embed this comment.
  17. I find this article truly bizarre. Mr Clark is clearly in ecstasies over the fact that ePub format that has been adopted for delivery of most ebooks is based on XHTML, but singularly fails to explain why. He assumes that because HTML (and the failed XHTML) are web standards then it is automatically a good thing that they should be used for web books, even though he admits that they are not designed for and lack the constructs to properly display books on the web. Anyway, if you were starting now, would you really design HTML and CSS the way it is? No, neither would I, I would profit from the fifteen years experience and wipe the slate clean and construct new and better standards (Java v. C anyone?). Clearly an HTML/CSS-based format was the easiest thing to do, the easiest thing to get agreement on, and convenient for Apple to use to fight Kindle. But a triumph for the W3C? Get a grip.

    And from the sublime discussion of web standards the article equally bizarrely descends to the level of presentation of em dashes. Well let’s hope most of the books are written in English English, where the typographical convention solves Mr Clark’s problem. But, somehow I don’t think the e-Publishers will be consulting Mr Clark about the presentation of em dashes or anything else.

    Copy & paste the code below to embed this comment.
  18. The last character in this line is not an end double quote, but and open double quote:

    “I’ve Got Chills. They’re Multiplyin’”‰”
    (apostrophe; thin space; end double quote)

    Otherwise, great article!

    Copy & paste the code below to embed this comment.
  19. Go back and reread the entire article, making sure to follow any link that even remotely seems to be discussing the limitations of HTML.

    Important note: Your objections were easily structured in HTML.

    Copy & paste the code below to embed this comment.
  20. Peter Schoppert, I have come to believe that page numbers in E-books make sense only if the format actually is meant as a copy of a printed book, viz an alternate format for a blind user. Otherwise page numbers are a mere skiamorph.

    It’s trivial to include fragment identifiers for all block-level elements, including e.g. headings and paragraphs, making it easy to link to any of those. I frankly don’t see the need to be able to link to any individual word post-facto.

    Nonetheless, the problem of a defined structure for annotations is real and rather pressing for production workflow. (Why else are people addicted to the methadone of MS Word?) I have no solution, but then again, I can’t be expected to have one.

    Copy & paste the code below to embed this comment.
  21. JohanKool, we are aware of more than one necessary correction.

    Copy & paste the code below to embed this comment.
  22. Your main message is absolutely correct – the same mix of HTML, CSS, and JavaScript that drives the web will drive e-books, too. As you point out, there’s no prescience involved, just a willingness to accept the obvious. I would say that, I too, am an HTML triumphalist, but it’s too damned difficult to pronounce. ;)
    I would go one step further, though. That same mix of HTML+ will be the engine of print publishing, as well. When I see the kind of time and energy being spent by someone like Håkon Wium Lie on HTML-to-PDF conversion software like “Prince”:http://www.princexml.com/overview/ , it seems I’m not alone in that belief, either.
    If you take into account the number of web pages printed in a day, you could argue that HTML already is the main engine of desktop publishing.
    We just don’t think that much about media=“print”, yet. The tools aren’t far enough along. The economics of print publishing don’t quite yet demand the move to one-off print runs. But the day is coming. And this throws a different cast on what e-books can be – which, in your article, assumes a strictly onscreen display.
    Some thoughts:
    Justification and Hyphenation – Dismissing this as “purely a display arifact” seems, well, dismissive. hah! Why don’t you just say H&J isn’t necessary and nobody should bother with it? I think that’s wrong but happily there is no conflict (nor should there be) between providing -  within the same HTML-based e-book document – “typeset” H&J text and “ordinary” text. Each can have its own stylesheet. One does not preclude the other. The only downside is more bytes in the file, that’s all. If someone wants to put in the work so it looks a particular way onscreen or in print, let ‘em. Please take the “don’t touch” sign off.
    Spacing – Geez,is it possible I “influenced”:http://readableweb.com/ie8-bug-html-spacer-entities-create-one-pixel-jog-in-line-height/#comment-14 you?
    Yes, the emspace, enspace, and thinspace characters are perfectly legit – they, and the other general punctuation characters (8192 thru 8203) can and should be used. Turning to CSS and spans and jumping through hoops for what these spacing characters do effortlessly is ridiculous. There’s a couple of wrinkles though. Most browsers synthesize spacing characters from the metrics of the font, even if the font does not specifically contain the character. (Many newer fonts do contain these “empty” punctuation characters and certainly every font to be used with @font-face should.) Strangely, synthesizing the space is actually non-conformant with CSS 2.1. What Opera does – which is show the “not defined” character in a rectangular box – feels wrong, but it’s technically correct. Incidentally, the “web safe” fonts don’t contain these spacing characters either – what you see when you specify   is synthesized by the browser.
    Scrolling – I believe the need to scroll is a provable “concentration breaker”.  And talk about display artifacts – was the scrolling window not mostly a matter of programmatic convenience? And why doesn’t Barnes & Nobles carry scrolls instead of bound volumes? A scrolled page must end somewhere, right? So why not end it where there is no more screen to display it? (But on the other hand, an insistence that a page onscreen must look like the page of a book is nonsensical, I’m with you there.) Related to scrolling is columnized layout – there too, I believe, there are provable advantages. Especially when “skimming” the text for points of interest.
    E-Readers – right now, apps like Mobi and Stanza and the like may be necessary but in the long term they will fade away. The idea of an e-book as a “book” existing isolated and apart, unchanged and unchangeable until the next “edition”, disconnected from a network, is absurd.
    Mobi, Stanza, who needs ‘em? Everybody already has an e-reader, it’s called a browser.

    “Rich”:http://readableweb.com

    Copy & paste the code below to embed this comment.
  23. Dear Joe,  I write all my papers with a plain text-editor
    (emacs), in the TeX “markup language” (some people refer to
    ‘latex’, since that is a popular macro package for TeX.)
    (e.g. X_i means X subscript i and \alpha is the letter alpha in TeX.)
    I end up with a PDF file full of nice math text and
    everything else I need.  (Go see some of the papers at www.civilized.com)  I don’t think HTML5 or anything else
    I have heard of would handle the preparation of a document that
    might have some technical content.  – Gary Knott.

    Copy & paste the code below to embed this comment.
  24. Very interesting article and worthwhile reading.

    It’s also my experience that HTML is the best basis for ePub eBook production, but the problem is that most book source files (anyway in my case) are in Word or RTF.

    Converting those sources to proper HTML as described in this article seems to be an utopia. Carriage returns, for example, are converted to paragraphs and blank lines between paragraphs are converted to separate paragraphs with  

    And so on…..

    So what we need is a proper DOC/RTF to HTML converter, Not the “Save as HTML (filtered)” in Word, because that doesn’t work!

    Copy & paste the code below to embed this comment.
  25. The lack of HTML approximates, or equivalents, for page numbers, sections, footnotes/endnotes (citations), cross references and indexes, isn’t because the markup language isn’t rich enough. It’s because the content isn’t rich enough. This can be over come by simply adding unique identifiers to the document objects for use as anchors.

    Still, as simple as the solution is, it would be great if someone would take the lead and publish some basic conventions so that the problem didn’t have to be solved and re-solved over and over and over again. The citation industry is both progressive and vibrant but it is geared towards problems a lot bigger than putting a few footnotes in an ebook.

    This is almost a job for microformats…

    Copy & paste the code below to embed this comment.
  26. Hi there,

    Thanks for sharing such a great article. I totaly agree with the Statement you’ll mentioned that “HTML isn’t just for the web. It’s for any text distributed online.” Any text that is being shared online is definitely related to the HTMLs.

    I am a designer myself and I find HTML the main source of all the data being trasfered or shared on the web, the design we create tell the visitor what things are displayed and used on our page, but everything that is present on the page is because of the HTML.

    Nauman Akhtar

    Copy & paste the code below to embed this comment.
  27. Thanks for the article. A remark or two:

    Dashes.”ƒAs commonly used in print books, em dash (—) with no spaces on either side does not work in onscreen text. [”¦] En dash (—) surrounded by spaces avoids linebreak problems and works better at the intended purpose.

    The ALA styleguide seems to say: use em dash with no spaces, not en dash with spaces. Subject to change?

    I’m not sure about English, but German typography forbids a line to start with a dash; correct usage would be: no-break space; en dash; space.

    bq.    “I’ve Got Chills. They’re Multiplyin’”‰” (apostrophe; thin space; end double quote)

    There must not occur a line break between punctuation characters, hence U+202F narrow no-break space should be used: apostrophe; narrow no-break space; end double quote.

    Copy & paste the code below to embed this comment.
  28. Perhaps I have misunderstood the article, but it seems to be confusing HTML with XHTML. They are very different beasts, although, at first, it is difficult to tell them apart because they do look similar. They have very different behaviours. XHTML is cool :)

    It is more acceptable to interchange XML with XHTML, as XHTML is a flavour of XML.

    Also, the link to Ben Hammersley’s call for XML seems to end in 404.

    And as for Java vs. c, there are some things that c can do that Java struggles with, even though it means the programmer must then struggle with c. sigh

    Copy & paste the code below to embed this comment.
  29. Thank you for an excellent article and discussion.  I hope you are planning to write a book on this topic.  I would love to see real-life examples and would definitely purchase it.

    Copy & paste the code below to embed this comment.
  30. The ALA styleguide seems to say: use em dash with no spaces, not en dash with spaces. Subject to change?

    That’s the American standard from the Chicago Manual of Style, and frankly I think it looks terrible on reflowable material unless the renderer is smart. Bringhurst says to use en dashes with spaces, which is the English standard and generally works better with the rather dumb rendering systems used in embedded devices.

    Copy & paste the code below to embed this comment.
  31. Very interesting article, it shows you were born in that jungle. This is why you shoud visit www.makeyourebooks.com. We got the XML tagging to the industrial processes. I am sure you will be interested.

    Copy & paste the code below to embed this comment.
  32. Great article. Maybe it’s just a subtle difference in semantics, but I feel the use of the term “online” is too restrictive; I would use the term “digitally” instead. When people read ebooks, they won’t necessarily be online (i.e., connected to the internet). Hence:

    HTML isn’t just for the web… It’s for any text distributed online digitally.”

    HTML is the preferred way to mark up and publish online digital documents that are not websites.”

    Copy & paste the code below to embed this comment.
  33. Hi. I’ve been producing ebooks since 1998, so I may have something worth contributing to this debate. I began way back then with the premise that one could use HTML to format a book so as to make it readable online. This premise was borne out of an aesthetic reaction to reading in plain text.

    I quickly adopted a second premise: that these ebooks would be new editions, rather than attempting to be facsimiles of a particular print edition.

    I’ll be the first to admit that my earliest efforts were quite horrible. (Although I thought they were good at the time.) HTML by itself is inaqequate to the task. But HTML with CSS produces—I believe—results that are the equal of print for the vast majority of books. I’m slowly upgrading the older editions to match my current standard.

    To address some of the issues raised:

    1. Numbering. Page numbers make no sense in ebooks (as new editions) because there are no pages. They are an artifact of print. However, we are so used to being able to reference text via page number, that many readers are lost without them, no matter how many times I say it’s perfectly OK to cite using a URL.

    In my ebooks, chapters are numbered, as are sections, parts and anything else likely to be referenced in a ToC. But I don’t number paragraphs (although it would be easy to add ‘id=“n”’ to each para.) because I don’t see an easy, unobtrusive way to inform the user of their existence.

    2. Sections. My approach here is to simply use a DIV wrapper to each section (or chapter, …) with a class attribute. E.g. your example of spacing between sections is easily achieved with <div class=“section”> and “section { margin-bottom:2em; }”

    To that extent, I don’t believe HTML lacks structural features, since you can define them arbitraily as needed.

    3. I note the appearance in the comments of various suggestions as to why HTML is not as good as TeX/Docbook/etc. This is depressing because I recall similar discussions from ten years back.

    ePub is the best format for ebooks because, unlike all those other formats, HTML is ubiquitous, and easy enough for authors to manage. Or, to be completely minimal, authoring can be done in plain text, which can then be easily converted to ePub/HTML in a largely automated process. I do this daily with books from Project Gutenberg.

    4. I would argue that many of the typographical conventions are actually “house style”—e.g. the use of small caps. Much can be managed with css.

    5. XHTML/CSS can handle quite complex formatting—e.g. positioning of text around images, margin notes.

    If interested, my ebooks are available at http://ebooks.adelaide.edu.au
    I do not claim to have everything right. But I welcome constructive criticism.

    Copy & paste the code below to embed this comment.
  34. Now, the foregoing is so optimistic as to be ridiculous. Authors are not going to start writing in HTML, let alone the full-on XML that Ben Hammersley has called for. Book copy will continue to be saved as MS Word, Xpress, and/or InDesign files. Though mangled and inadequate, such copy will then be “exported” for E-book “formatting.”

    I believe Ben’s article has been taken down.

    I believe those still in confusion between the writing creative process and the technical (although not destitute of its own creativity) book creation process would benefit from reading on how writers write.

    I believe that after a few testimonials one will see that with matters of heart and inspiration, technology is more than welcome to be useful but only whilst staying ubiquitous:

    J.G. Ballard: How I write:
    http://entertainment.timesonline.co.uk/tol/arts_and_entertainment/books/article439694.ece

    How I Write by Bertrand Russell
    http://www.solstice.us/russell/write.html

    How I Write by Richard Milward
    http://www.faber.co.uk/article/2009/2/how-i-write/

    Neil Gaiman: how I write
    http://www.timeout.com/london/books/features/2100/Neil_Gaiman-how_I_write.html

    http://www.thewritingcentre.com/how-i-write

    It still is all about the users, even when they are the authors themselves.

    Cheers!

    Copy & paste the code below to embed this comment.
  35. so, joe clark has joined the .epub adherents proclaiming
    that all other formats, heretofore and in the future, must
    die.  yawn.  html, with or without the x, is already fading.

    many of us are already in the process of making better
    formats, and we ain’t stoppin’ just ‘cause joe says so…

    look at the comment box for this very blog article.

    it allows textile, because authors like light-markup
    —because it stays out of our way when we’re writing.

    and sure, then we convert it to (x)html, and from there
    it might get shuttled on to .epub format, but why?

    why not just have the rendering agent take textile input,
    convert it to (x)html itself, and then to .epub if it must,
    and then display it.  why do we have to do all of these
    conversions, that the machine can do just as well itself?)

    the answer is that we don’t.

    we feed the machine textile (or another light markup).

    and once the machine realizes that it could render our
    textile file, as raw input itself, and render it as easily as
    (x)html or .epub, it won’t even bother to do conversions.

    make all the proclamations you want, joe.

    we ain’t listening.  and we ain’t stopping.

    we’re making the future, and you can’t fight it.

    -bowerbird

    Copy & paste the code below to embed this comment.
  36. Hi bowerbird,

    I find Joe Clark the kind of writer that forces me to re-read things constantly, he’s got a writing style that declares the most absurd things upfront, factually and without prejudice, and then discusses if they are relevant/feasible/sensible or not, establishing exceptions just later.

    But I think he is not really saying that HTML will serve all books as a single distribution solution:

    ***

    HTML doesn’t work for all documents, since it lacks important structural features. (HTML5 addresses some of those deficiencies but won’t help today’s E-books.) HTML does work for huge numbers of documents, many of which we call books. Bet against HTML for online distribution and you’ve backed the wrong horse.” – Joe Clark

    ***

    As far as I can see, HTML can serve most books out there but it might not serve those books whose text formatting is also part of the narrative as it happens with certain styles of poetry where the space and arrangement of the words on the canvas/paper also convey meaning, as it is the case of concrete poetry, a poetic style still very popular in my motherland Brazil (http://en.wikipedia.org/wiki/Concrete_poetry).

    I believe the article doesn’t exclude the possibility of those literary exceptions that make literature a 4 dimensional experience, an experience which HTML still can’t represent easily.

    I am not totally against the creation of a full-blown XML language for literary content that tried to close this gap instead of HTML (as long as we don’t assume the Philistine attitude that writers must write under the terms of a technical format standard).

    Nevertheless, in the current context, HTML can still be useful for a great range of standard publications and books. In my humble opinion.

    Cheers,

    Luis

    Copy & paste the code below to embed this comment.
  37. I enjoyed the article — it’s exactly what needs to be said. We’re in the middle of an illusory “backlist goldrush” at the moment, meaning digitize and damn the quality, but the glut of bad ebooks will soon prompt publishers to think they can differentiate big releases on quality again. I don’t think it’ll take too long.

    Anyway, a correction on the NY Times reference. The linked article says: “So on a $12.99 e-book, the publisher takes in $9.09. Out of that gross revenue, the publisher pays about 50 cents to convert the text to a digital file, typeset it in digital form and copy-edit it. Marketing is about 78 cents.”

    It’s kind of mind-boggling that one could read that and think it meant 50 cents total. It’s 50 nominal cents per copy sold.

    But otherwise, a great article. More of this please, ALA.

    Copy & paste the code below to embed this comment.
  38. There have been comments about how to render footnotes within .epub/xhtml. Although this is important, a corresponding feature should be to allow the ebook to be cited, especially with non-fiction. Generally, it is simple to reference a paper copy, but citing an electronic version is oddly trickier.

    First there should be a URL that is associated with the publication. It may be that there are thousands of posted versions of public domain books, but each should have a corresponding URL. On the possibility that each version should have differences, intentional or not, having a separate URL for each instance is important.

    Second, the publisher should make sure that each portion that might be cited should have an id attribute built into the document. That could be based on div or paragraph tags, or even span tags.

    Third, there should be an obvious way to show the embedded id so that anyone citing or copying a small portion can easily include the citation back to the document being cited. I came up with a simple way using standard HTML tags that allow this, that I call Embedded Self Cites. The citation would be a URL that would include an HTML fragment reference (e.g. http://ebook.example/booktitle#chapter1para4 ). (see http://advocatehope.org/tech-tidbits/embedded-citations )

    In addition to using a URL for a citation, each ebook, ebook abstract or ebook portion that can be cited with a URL, could be printed on paper with a QR barcode for URL that would allow most smart phones or ebook reader with camera to scan in the URL barcode and go to the ebook directly or a pay page/log in. (see: http://docs.google.com/Doc?docid=0AV6jPr0LRFa0ZGZ4Z2NkZmNfODZnZmpnbXJkdg&hl=en_GB#Bar_Codes )

    - Daniel Bennett

    Copy & paste the code below to embed this comment.
  39. Great help on e-books, i find e-books can be great for learning online and with modern applications such as iPhones they are easy accessible.

    Copy & paste the code below to embed this comment.
  40. Dead-on commentary, Joe. Since the early days of computer typesetting, publishers have been using some form of simple inline markup for formatting. You’re so right to say that HTML is the evolutionary heir to all that.

    People need not be daunted by the apparent complexity, either. As you say, it’s only a limited subset of HTML that’s needed for most E-books. Learning HTML is almost unnecessary for publishers if they adopt the use of “Markdown (a text-to-HTML conversion tool for web writers)”:http://daringfireball.net/projects/markdown/ (I have no personal interest in Markdown, which is offered free, but as soon as I saw it I recognized its value).

    As always, Joe, I really appreciate your insights.

    Copy & paste the code below to embed this comment.
  41. Sigil is an open source project which is also challenging InDesign. It is a WYSIWYG editor which runs on Mac, Windows, and Linux.

    Copy & paste the code below to embed this comment.
  42. I was dismayed that in your article about eBook Standards, you did not address the issue of eReaders deliberately not following the spec in their benevolent but misguided attempt to display ebooks more attractively to readers (as described at the beginning of this article: http://www.pigsgourdsandwikis.com/2010_03_01_archive.html)

    I think it is absolutely essential for Web standardistas to stand up right now and call for eReaders to follow those specs. Otherwise, it is only a matter of time before someone does develop an eReader that listens to the book designer regardless of the standards (Netscape anyone?) But by then it will be too late.

    Further, I am so incredibly tired of people being paternalistic about what should be allowed and what shouldn’t. Whether you think video in a book is useful is not relevant to eBook standards. I can think of many instances, particularly in technical books, in which case a short video tutorial would be much more instructive than a series of screenshots.

    And Apple says not to use fonts when designing eBooks for the iBookstore, because it “creates a bad user experience”. Excuse me? Get your hands off my design. If you don’t like it, don’t buy it.

    Regardless, it’s not for you, Apple, or anyone else to decide on aesthetic grounds. Either it follows the standards, or it does not.

    (and frankly, it’s a pain constructing a complicated rebuttal in this tiny text box!)

    Copy & paste the code below to embed this comment.
  43. When I posted initially I hadn’t actually looked at any ebooks, which is why the idea that they should be done in HTML seemed so bizarre to me. Since then I’ve acquired an iPhone (not an iPad) and have downloaded a few free ebooks in different formats, and the eReader. I can see now why Mr Clark was going on about em-dashes and block paragraphs (hatred of the latter being one area I agree with him) – and to which I’d add dumb quotes – because the typographic execution in the examples I’ve seen (and those in screen shots from paid books) varies from poor to dreadful. I certainly couldn’t bring myself to read anything on my iPhone set like that, and I see little prospect of any improvement, except perhaps on the iPad, where the Kennedy book that Jobs presented may raise expectations. Mr Clark seems to have got his wish for the triumph of HTML in eBooks. He should have perhaps been a bit more careful what he was wishing for.

    Copy & paste the code below to embed this comment.
  44. Thanks Joe Clark. Good points. Many times change for the good means diminishing status for icons.

    Thanks a list apart. Another fine topic well presented.

    With respect to ebooks. Purchase two books a couple months ago from O’Reilly.

    Automating System Administration with Perl, 2Ed
    CSS Cookbook, 3Ed

    Purchased – both print and ebook. It was bundle, and I was curious. For my purposes, with Acrobat 8 Pro, the PDF version has more utility than EPUB. At least for now. BTW, I don’t have a reader.

    I viewed the EPUBs with Adobe Digital Editions 1.7.4 (I think) on a Vista PC.

    But for those interested, I looked into the EPUB files with PKZIP. Perhaps some of the other commenters may be interested.

    I unzipped the EPUB file and looked inside the files with Wordpad. There’s both HTML and XML inside.

    Eg.for Automating System Administration with Perl, 2Ed
    There’s XML here –
    in OEBPS/content.opf
    in OEBPS/toc.ncx
    in META-INF/container.xml
    and HTML in
    in OEBPS/index.html
    Kind of quirky, it looks like this –

    <?xml version=“1.0” encoding=“UTF-8” standalone=“no”?>
    <!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.1//EN” “http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd”>
    <html ><head>
    … etc.

    EPUB looks more like a hybrid.

    I tried looking here too –
    OEBPS Container Format (OCF) 1.0 Specification Documents
    HTML Version
    http://www.openebook.org/ocf/ocf1.0/download/ocf10.htm
    See APPENDIX B: Example

    Have a good week.

    Brian

    Copy & paste the code below to embed this comment.
  45. I’m a writer, a poet at least. My current project, though—a custom document management system to resolve some bad version control problems I run into as I write at home and on the road—led me to see what ALA had to say about the topic.

    While I agree with the value of HTML for presenting these documents (which represents the potential for a massive open-standards win over the proprietary systems to which we lend our literary art and scholarship) I’m not sure the XHTML-now approach is the best. I’m inevitably biased by my own proclivities, but I think it’s easier to expect writers, editors and publishers (“book people” as Clark says) to produce well-formed XML with semantic markup in terms they already understand.

    I don’t think it’s hard to convince a poet that she can write his poems in XML with tags like and <TITLE> and <STANZA> and <LINE> but it’s going to be a much greater leap (and will, undoubtedly, be pushed off on someone else) to expect her to understand how to use HTML’s not-quite-perfect-yet markup in a semantic manner. Even if a few variations of the XML tags crop up, the sets used will be relatively simple, and this sort of parsing made relatively trivial.

    Let writers and publishers invoke their natural vocabularies through XML, and let the standardistas make decisions about how to parse this XML into display-ready XHTML as necessary. Then our parsers can adapt to the further evolution of HTML as a language (and common practice in e-publishing, no doubt) by continually adapting comprehensive XML documents into the markup that makes the most sense today, and markup which makes sense 20 years from now.

    Copy & paste the code below to embed this comment.
  46. I don’t know anything about the business of publishing, but I use CSS, and this was interesting to me,

    “This book has been written entirely in HTML and CSS.”
    … page 353, Chapter 18 – The CSS Saga, the last sentence of the last paragraph.

    Cascading Style Sheets: Designing for the Web, 3/E
    Hakon Wium Lie
    Bert Bos,
    ISBN-10: 0321193121
    ISBN-13:  9780321193124
    Publisher:  Addison-Wesley Professional
    Copyright:  2005
    Format:  Paper; 416 pp
    Published:  04/25/2005

    http://www.pearsonhighered.com/educator/product/Cascading-Style-Sheets-Designing-for-the-Web/9780321193124.page#takeacloserlook

    The sample chapter is “Chapter 3 The amazing em unit and other best practices”
    http://www.pearsonhighered.com/assets/hip/us/hip_us_pearsonhighered/samplechapter/0321193121.pdf

    I’m not a poet, but I humbly suggest that markup – HTML, XHTML, or XML, should be easy for a poet. Or for each poem.

    To me, from a formatting standpoint, a poem has a lot to do with where the line ends and the “paragraphs”.

    “Music is the space between the notes.” Debussy

    In this case, I would “try” surrounding the poem with a div and an id. Then use classes with repeating block elements to make related sets of lines look “right”.

    If I were generating the content, I might try writing in Notepad without markup tags, and separating each set of lines with a pair of carriage returns and linefeeds. Then, when done, I would turn word wrap off, and save.

    This would make if easy to add the markup at the beginning of each “line” or “paragraph”. And somewhat simple to add the closing tag.

    Then, into my web page, with something like –
    <body>
    <div id=“poem”>

    </div> <!—close poem>
    </body>
    … Cut and Paste from the notepad file.

    Perhaps the classes might be defined like –
    #poem p.haiku
    #poem p.limerick
    #poem p.the-bop

    Just a thought.

    Sometimes it can be straight-forward to do a stand-alone markup document. As compared to a multitude of documents and media types in a web site.

    Have a good weekend.

    Copy & paste the code below to embed this comment.
  47. Dashes.”ƒAs commonly used in print books, em dash (—) with no spaces on either side does not work in onscreen text.

    Not only that, in Spanish you just do not use em dash (—) with no spaces on either side. Em dashes —if used correctly in that language, that is— function just like parenthesis if in the middle of the sentence —although you do not close them at the end of one.

    Even if the above is incorrect usage in English, I just wanted to illustrate.

    They require a space before the interruption (and after, if there is no period), and just like parenthesis, you absolutely do not break the line between it and the letter that sits next to it.

    Unfortunately, “Unicode’s Line Breaking Algorithm”:http://unicode.org/reports/tr14/ is English centric (booh!) and says that em dash “provides a line break opportunity before and after the character”, a complete aberration in Spanish typesetting (should be the exact opposite). As a result, pretty much any engine that displays text on screen, modern or old (including of course any browser or ebook reader out there) is chopping lines in Spanish text leaving orphan em dashes at the end of lines. No single ebook or webpage is surviving this. Unless one goes and manually litters all em dashes with zero width no-break spaces at both sides, which is rather gross.

    Copy & paste the code below to embed this comment.
  48. “Peter’s comment”:http://www.alistapart.com/comments/ebookstandards/P10/#16
    We need this also as a base to support annotations. Everyone (Adobe, Stanza, Ibis, Apple) seems to be introducing proprietary solutions for annotations, in some flavor of XML (that gets written into a file and added to the epub manifest? or which lives only in the reader?), but eventually we want to share and aggregate annotations.

    “Joe’s comment”:http://www.alistapart.com/comments/ebookstandards/P10/#20
    Nonetheless, the problem of a defined structure for annotations is real and rather pressing for production workflow. (Why else are people addicted to the methadone of MS Word?) I have no solution, but then again, I can’t be expected to have one.

    Supporting “marginalia”:https://secure.wikimedia.org/wikipedia/en/wiki/Marginalia (annotations, etc.) has been a goal for a long time, and not just for ePubs. See these references:

    • “Seeing the picture – Crowdsourcing annotations for books (and eBooks)”:http://blog.lib.uiowa.edu/hardinmd/2009/06/08/crowdsourcing-annotations-for-books-and-ebooks/
    • “From Personal to Shared Annotations”:http://www.csdl.tamu.edu/~marshall/CCM-AJB.pdf
    • “Social Annotations in Digital Library Collections”:http://www.dlib.org/dlib/november08/gazan/11gazan.html”

    “How to express and exchange annotations”:https://github.com/nichtich/marginalia/wiki/Support-of-PDF-annotations focuses on PDF annotation methods.

    “The Fascinator”:https://fascinator.usq.edu.au/trac/wiki/Annotate/existing also has some information, as does “WikiPedia’s Web Annotation article”:https://secure.wikimedia.org/wikipedia/en/wiki/Web_annotation

    “ncarr’s comment”:http://www.alistapart.com/comments/ebookstandards/P20/#25
    This can be over come by simply adding unique identifiers to the document objects for use as anchors. This is almost a job for microformats”¦

    I disagree. I think it should be based on “DocBook”:http://www.docbook.org/ or some other XML format. (In DocBook, it’s a solved problem.) DocBook has support for several missing features of ePub: <chapter>, <section>, <sidebar>, <equation>, <figure>, <footnote>, <annotation>, <set> (a collection of books-like an encyclopedia or The Art of Computer Programming), as well as support for “MathML”:http://www.w3.org/Math/ and “SVG”:http://www.w3.org/Graphics/SVG/ .

    Still, as simple as the solution is, it would be great if someone would take the lead and publish some basic conventions so that the problem didn’t have to be solved and re-solved over and over and over again. The citation industry is both progressive and vibrant but it is geared towards problems a lot bigger than putting a few footnotes in an ebook.

    I agree. Having some high quality examples would be beneficial.

    “Daniel Bennet’s comment”:http://www.alistapart.com/comments/ebookstandards/P30/#38
    First there should be a URL that is associated with the publication. It may be that there are thousands of posted versions of public domain books, but each should have a corresponding URL. On the possibility that each version should have differences, intentional or not, having a separate URL for each instance is important.

    This actually exists. See “Document Object Identifier”:https://secure.wikimedia.org/wikipedia/en/wiki/Digital_object_identifier (though this also “has problems”:https://secure.wikimedia.org/wikipedia/en/wiki/Baen_Books#Baen_Digital_Object_Identifiers_.28DOI.29 .)

    Copy & paste the code below to embed this comment.
  49. This is by far one of the most well written and thoroughly researched articles on ebooks and html. Having a little knowledge in html can make a huge difference in creating ebooks. Without it good luck keeping the formatting of your original document. Keep up the great work!

    Copy & paste the code below to embed this comment.