PDF files on the web are sometimes annoying and very often unnecessary. But when they aren’t either of those things, we need to make them accessible for the same reasons we make other web content accessible.
Contrary to popular opinion – and also contrary to quasi-judicial claims in some places – PDF documents can be no less accessible than HTML. While this may be a shocking revelation, it is nonetheless true. This article will explain how PDF does and does not support accessibility.
- Most PDFs on the web should be HTML.
- Some documents really should be PDFs.
- You can add XML-like tags to give structure to a PDF.
- Tags weren’t available until a recent upgrade to the PDF file format.
Most screen readers in common use can read PDFs.
- Screen readers had to be upgraded to understand tags.
- Screen readers have been continuously updated throughout their history, and even today some screen readers cannot handle parts of the HTML spec.
- Even an untagged PDF can be accessible if you’re using the right technology.
- Posting a PDF online with no HTML alternative does not automatically constitute discrimination.
Let me dedicate this discussion to two nations that struggle under the yoke of lies and misunderstandings concerning PDF accessibility.
First, to the people of Australia, whose federal human-rights body, the Human Rights and Equal Opportunity Commission (HREOC), takes the official position that posting a document online only as a PDF is inaccessible, hence a violation of the Disability Discrimination Act.
I learned this the hard way from Bruce Maguire. He vanquished Juan Antonio Samaranch in a hearing with that selfsame HREOC over the inaccessibility of the Sydney Olympics website, and he now works for the HREOC. Maguire monologuized endlessly about the topic when I met him in 2004. When I got a word in edgewise, Maguire was able to agree, as most of us will, that HTML is the preferred format, but he also suggested Microsoft Word as an alternative to PDF. This may tell you all you need to know about HREOC’s understanding of accessibility and interoperability.
I told Maguire then and I’m telling everyone now: Don’t believe the HREOC if they come after you claiming your PDFs are inaccessible, hence illegal. If you created your PDFs incorrectly, HREOC may be right, but only sometimes, and there are many cases where your PDFs may be just fine.
If you get hauled in front of HREOC for “illegal, inaccessible” PDFs, consider this article a case for the case for the defense, to paraphrase Christopher Hitchens. (See discussion below.)
Second, to the people of Canada, where the federal government’s guidelines for its own websites – known euphemistically as Common Look and Feel, and the source of great pain – state falsely that PDF is “not directly accessible to persons with (primarily) visual impairments” and that only “minimum version 2.1” should be used. Oddly, there is no such thing as a PDF version 2.1.
The Alliance for the Equality of Blind Canadians, a lobby group that definitely is not the Canadian National Institute “for” the Blind, passed a resolution at its 2005 annual meeting stating that “Portable Document Format (PDF) continues to provide barriers to blind, deaf-blind and partially sighted people and their enabling technologies; herefore, be it resolved that the AEBC advocate that PDF not be used as a standard for providing documents on all websites.” Well, who was suggesting that in the first place?
A view of the landscape
Let’s begin at the beginning and discuss the entire landscape of PDF on the web before we learn what makes a PDF accessible or inaccessible.
The first thing to do with a PDF
...is Google the URL. Seriously. Most of the time, Google does a half-decent job of making a PDF readable in HTML. Poor character encoding, ill-constructed multicolumn PDFs, and document security features can prevent Google from indexing a PDF at all, or doing so readably. Nonetheless, that’s what I do first.
PDF is overused
There aren’t many categories of online document that really should be PDFs and nothing else. And the list has decreased by one in the last year, since presentation slides can now be adequately handled by Eric Meyer’s S5 method. (Hence I have no excuse anymore for publishing my own presentation slides in PDF, so I’m going to stop.)
But if your document is one of the following, PDF may be fine:
- Footnoted, endnoted, or sidenoted, since there is no way to mark up any of those structures in HTML. (You can use a hack like
supfor the footnote reference, but there are no
sidenote, or even
noteelements. That hack may be adequate for simple footnoted documents, but try rendering David Foster Wallace’s footnotes-within-footnotes in HTML 4.)
- An interactive form, since PDF interactivity can do more than HTML can. (Use with caution and only if HTML really cannot do what you want.) For examples, check Jeremy Tankard’s order forms, especially for TypeBookOne (PDF).
- A multimedia presentation, since later versions of PDF can truly embed multimedia rather than simply refer to or call multimedia, as HTML does. (Same warning as above.) PDF multimedia can include captions and/or audio descriptions.
- Combined accessible and inaccessible versions. A typical case is a scan of a historical document that also includes live text. (You really need that live text. The Smoking Gun’s scanned court documents wouldn’t pass muster here.) Another example – one that is legal in Canada under a copyright exemption – is a sign-language translation inside or alongside a written text or audio recording.
- Custom-crafted solely for printing. I really mean that, and not a document so badly designed that people have no choice but to print it out because reading onscreen is so tedious. Your service-bureau files, if they are on the web at all, can stay PDFs.
- Designed for annotation and round-trip travel: If you’re posting something to elicit comments, which are then sent back to you, PDF has useful structures that HTML doesn’t.
- A type specimen, which are all but impossible to create in HTML, unless the specimen involved is a “typeface” like Arial.
- A sample of a format that cannot be rendered in a browser (e.g, Illustrator or Photoshop documents) or can only be rendered unsatisfactorily (CAD drawings where GIF and JPEG don’t have enough resolution). (In theory you could use SVG for CAD, but SVG remains mostly theoretical, doesn’t it?) This case also includes PDF files meant as samples of PDF files.
- A record of a document’s state at a specific moment. In this context, PDF is useful as a preservation format even for HTML web pages.
- A document in a language whose script has no satisfactory support in web browsers. This example must be used with caution: In 2005, there aren’t many “minority” languages that cannot be rendered in a browser. Perhaps this case must be limited to scripts that have not been encompassed by Unicode (of which there are several). This can also be a subset of the type-sample case if your PDF is meant as an illustration or documentation of the writing system used by a language.
- Mathematical, since even MathML cannot render certain notations.
- Documents with a legally restricted format, like U.S. tax forms.
- Documents with digital rights management, which everybody hates and which has likely accessibility barriers. (The use of 128-bit encryption with PDF is compatible with screen readers.)
- Multicolumnar, particularly if figures and illustrations are included, since multicolumn web layouts are a mere hack and are unreliable as a method of reproducing print layouts. (Your multicolumn document should be HTML if it is presented that way merely to save paper and it can work as a single column. It can be difficult to distinguish that case from a document that is structurally multicolumnar, and this category is somewhat iffy.)
PDF is not Acrobat, or even Adobe
Let’s get something else out of the way: Acrobat isn’t PDF and PDF isn’t Acrobat. Many programs can display PDFs other than Acrobat. GSview is a popular choice and works on Windows, Mac, OS/2, and Linux. Other options:
- On Windows: Jaws PDF Editor (not the screen reader)
- On Mac OS X: Preview, GraphicConverter, Safari, and OmniGraffle
- On Linux or Unix: OpenOffice
- On PocketPC and Windows CE: Primer PDF Viewer
- On Symbian: PDF+
- On Amiga (!): Apdf
And many programs can create PDFs other than Acrobat. Nearly every application on Mac OS X can save a PDF, for example. Utilities to export to PDF without Acrobat on platforms like Windows and Linux are too numerous to mention, but sites like VersionTracker list them. Moreover, you don’t even need application software to create a PDF; some elite developers directly write their own native PDF files.
Your Acrobat version number and the PDF file format version number are two different things. It may come as a surprise that PDF actually has a version number, but it does. There is no single “PDF” format any more than there’s a single HTML format.
- The latest version of the Adobe PDF file format is Version 1.6 (released November 2004).
- Each new version has added a few features, many of them structural. PDF tags, which are rather important for accessibility, were added in PDF 1.4.
- PDF versions for archiving (PDF/A) and “exchange of print-ready pages” (PDF-X1a) are already ratified or in the process of ratification by standards bodies.
- There’s a working group to define an “accessible” PDF format (I’m on it) and another for engineering.
The easiest way to keep things straight is to take the current Acrobat release number, subtract 1, and put the result after the decimal point. Acrobat 7 can read PDF 1.6 documents, for example.
Proprietary vs. open
You can’t make the categorical statement that PDF is “proprietary” and HTML is “open.” The World Wide Web Consortium copyrights its specs, for example, though with reasonable usage terms. Adobe publishes its versions of the PDF format, and has done so since Version 1.3.
I’ve never understood the objection that Adobe could change the PDF format overnight and render your documents useless. That objection applies to the mysterious Microsoft Word file format, but not here. (Word’s XML schemas have been published, but I’m not talking about those. Microsoft’s PR apparatchiks in the U.S. and Canada promised to get back to me about the actual state of disclosure of the Word file format, but never did.) PDF specs are published, and any of your documents that comply with the published spec will remain unchanged when the spec is updated. Just as a document validated against HTML 3.2 remained unchanged after XHTML 1.1 came out, your PDF 1.4 documents (for example) will continue to work into the indefinite future.
The entire discussion of proprietary vs. open is bogus. The relevant distinction is between published and secret. PDF and HTML are both published formats. End of story.
Multiple formats need to be accessible
The goal of the accessibility advocate is to improve accessibility for people with disabilities, period. We’re not interested in making only HTML web pages accessible. The entirety of web content is our purview, and that includes formats like PDF and indeed Flash. (Same goes for multimedia.)
To draw a historical analogy, when we in Canada and the U.S. managed to get one or two open-captioned TV shows on the air in the 1970s, we didn’t stop there. We invented a closed-captioning system that could be applied to all programs (and the Europeans reused their existing teletext system for the same purpose). Then we made sure that home videos and laserdiscs (remember those?) were captioned. Then we started releasing a very few prints of open-captioned movies.
Then we figured out a way to add audio descriptions for the blind to television programs. Then we hacked a method to produce described home videos. Then we developed closed-captioning and -description systems for first-run movies (and new systems for open captioning there). Then we used closed captions, subpictures, and audio tracks to make DVDs accessible. Then we developed methods to caption and describe online video. We kept up with technology and made each new format accessible.
Even if you never create PDFs yourself, I’m sure you will admit that it is necessary for this widespread format to be accessible for the same reasons we made the widespread format of HTML accessible.
We’re not just talking about blind people
And keep in mind that accessibility is not about making things work fine for blind people and no one else. Everybody falls into that trap at one time or another – the Web Accessibility Initiative included. In PDF accessibility, notable additional groups include deaf and hard-of-hearing users and people with learning disabilities. Motor or dexterity impairment becomes an issue in scrolling a PDF.
Keep this in mind the next time someone complains that a certain PDF is inaccessible because he or she couldn’t get it to work with a certain version of Jaws. We’re not working just for you.
Content vs. user agent
To finish this preliminary discussion, we need to understand the interaction between content and “user agents,” the latter being a term for browsers, media players, and other devices that present content to the user. As web authors, we’re so concerned with HTML and CSS that we either forget the role of the user agent completely – or we forget it until it bites us in the arse, as with CSS bugs in browsers. We tend not to notice the fact that web content is (in almost all cases) rendered by a browser; the user agent becomes invisible.
But because we have to switch to another program most of the time to read a PDF, we are suddenly reminded that a user agent is actually in play. This, too, is a source of confusion between Acrobat and PDF. The interaction of user agent and PDF content is as important as it is with HTML, as we’ll see; it’s just that we’re more conscious of that interaction.
The complaint that you have to use a “special program” to read a PDF document is bogus. You’re already using a special program to read an HTML document. It’s just that you use that program so much it no longer seems special.
Tags and structure
As with HTML, what makes PDFs “robust,” reformattable, and otherwise accessible to many people with disabilities is structure. An unstructured data format like a JPEG picture is hard to make accessible, at least to a blind person, but wrap it in the HTML structure of an
img element and suddenly accessibility becomes a real option.
A PDF is a database of different data types. You can include a wide range of text, graphics, and multimedia formats inside a PDF, a fact that led to a common misunderstanding that PDFs are glorified pictures. (They certainly can be, but they aren’t necessarily.) There really was no such thing as a structure to PDF until tags were introduced in PDF 1.4. It is OK to call them tags and not elements.
PDF tags are XML-like and will be immediately understandable to anyone with HTML knowledge. Many tags are functionally equivalent to analogues in HTML, such as
P, headings (including a generic, unnumbered
Heading element), and
Figure (image). But some of those tags have more features than their analogues in HTML. For images, you’ve got three levels of replacement text – “actual text,” useful for text rendered as an image, a drop capital, or an illuminated manuscript; “alternate text,” exactly as in HTML; and “title,” also as in HTML. You can and still should declare a language for your PDF document, just as with HTML.
PDF tags are extensible and you can create your own. However, there’s a predefined set.
A key difference here is that you cannot just fire up a text editor to add tags to your document, as you can with (X)HTML. Currently, you need to use application software, very much including Acrobat, to add the tags; because PDF is a binary format, that is unlikely to change.
As with semantic HTML, a tagged PDF can be reused and reformatted, since the application software knows what you meant by the data in the document. It knows that this text is a headline and this other text is a paragraph, so the software can, for example, reflow your text from a two-column to a single-column document. Reflow turns your multicolumn PDF into a zoom layout.
In general, we can say that a PDF is probably accessible if it is tagged. As with HTML, you can tag things improperly or unsemantically, though there is no concept of valid tagging with PDF. The mere presence of tags does not guarantee accessibility, because you might be using them wrong, but the absence of tags guarantees that the PDF itself is not accessible. Note the emphasis on itself; this is not the end of the story.
The user agent’s job
Here is where the user agent comes in. Just as web browsers have had to be engineered, at vast effort and expense, to cope with tag-soup HTML, Acrobat in particular has had to be engineered to cope with real-world PDFs. It’s a bigger problem, since few PDFs are tagged and the free-form database structure of PDF lacks even the quasi-structure of tag-soup HTML.
In this case, the user agent overcomes the inaccessibility of the content, and that’s how it should be even with HTML: The entire chain from author to reader has to be accessible, and any link in the chain can take up the slack. In accordance with the classic advice to be strict in what you produce and lenient in what you accept, if the PDF author creates inaccessible content, the reader software should try to fix it.
- Acrobat versions since 4.05 (with a Windows-only plug-in) have been at least adequately competent some of the time in making “inaccessible” PDFs functionally accessible.
- Acrobat 5 and later can infer a reading order and reflow text.
- Acrobat 6 and later can read text out loud on Windows and Macintosh, functioning as a de facto screen reader.
Note that this discussion mostly relates to blind or learning-disabled readers. A deaf person might just read the document with no trouble. Mobility or dexterity impairment is also involved here; later Acrobat versions can autoscroll a document without having to tediously click or actuate a scrollbar using your slow adaptive technology.
Thus it is impossible to state that even an untagged PDF is inaccessible. Acrobat (or some other program) may be able to use artificial intelligence to hack its way through a document to make it adequately accessible. If someone complains that your PDF isn’t accessible, you need to ask them what program they’re using to read it. Given that Adobe Reader (né Acrobat Reader) is free for Windows, Mac OS, and Linux and has all the accessibility features listed above, and is, moreover, compatible with many screen readers, it is a bit of a stretch to say that an untagged PDF could not possibly be read accessibly. Some PDFs will be inaccessible even with a really good reading program, but many will work adequately well.
Another recurring complaint in PDF accessibility – also bogus – is that screen readers either cannot handle PDFs or require costly upgrades to handle them.
All leading screen readers in use on Windows can read PDFs, including Jaws, Window-Eyes, IBM Home Page Reader, and Hal.
Remember Bruce Maguire? His presentation at the Web Essentials 2004 conference in Sydney – in whose lunchroom he and I talked – stated the following:
The PDF format has become widely used for making documents available on Web pages. Despite considerable work done by Adobe, PDF remains a relatively inaccessible format to people who are blind or vision-impaired. Software exists to provide some access to the text of some PDF documents, but for a PDF document to be accessible to this software, it must be prepared in accordance with the guidelines that Adobe have developed. Even when these guidelines are followed (and there are 32 pages of them), the resulting document will only be accessible to those people who have the required software and the skills to use it. Many blind or vision-impaired people do not have the financial freedom to spend the $1,000+ typically required to upgrade their screen-reader software to take advantage of the latest accessibility features. Requiring a user to upgrade to this extent in order to read a standard document is like designing Web content presentation in such a way that most people will have to buy a new computer in order to read it. Clearly, this is not a reasonable approach to the discharge of a government’s social responsibility to provide relevant information to its citizens. In any case, some of the PDAs used by blind people have no facilities for accessing PDF files.
Let’s unpack these objections.
- Preparing PDFs “in accordance with the guidelines that Adobe has developed” is in no way different from preparing HTML pages in accordance with the guidelines the W3C has developed. (Want to print those out? It’s a 34-page PDF.)
- It’s not like PDFs are the only item on your computer for which you require software and skills. You require both of those to surf the web and use HTML pages.
- It’s false to claim that blind people are “typically required” to pay “$1,000+” to upgrade their screen readers. Some devices that can read PDFs aloud are free, like Adobe Reader. Nobody is requiring an expensive software upgrade.
- Let’s look at the upgrade prices for Windows screen readers. Assumptions: All prices in U.S. dollars; you already own a copy of a screen reader that cannot read PDFs (increasingly unlikely).
- A Jaws for Windows Software Maintenance Agreement gives you the two subsequent releases for $180 or $260. Hence an upgrade to a PDF-capable version might be “free” under this plan.
- A similar scheme provides for three upgrades of Window-Eyes for $299.
- Upgrading Home Page Reader from any Version 2.5 or 3 release to Version 3.04 is free. Otherwise you buy the whole package again for a discounted price of $79.
- Upgrading from V5 to V6 of Hal costs $160 or $220.
- “PDAs used by blind people” need to be upgraded if they don’t understand PDF. Essentially, this objection boils down to “if it doesn’t work with what I’ve already got, it doesn’t work, period.” I guess time does not march on for these people. In that case, I hope you’re enjoying HTML 2.0 and your Geocities homepage.
The assumption seems to be that blind people only ever or only can or only must use Windows, and, as we all know, Windows screen readers are overpriced. Well, yes, they are, but you the blind person have many options now for computer accessibility.
Of course you may have to update your Windows screen reader and of course that might cost you money. You can’t complain that PDF is inaccessible (as it was for many years) and then act as though the problem hasn’t been addressed. Adobe rewrote the PDF spec to include tagging for accessibility, and, just as with any improved technology, your screen reader had to be upgraded to handle features that never existed before. You asked for something new and you need something new to make it work.
Most of the time when I run across this complaint, it strikes me as a peevish attempt to cling to the discredited idea that PDF isn’t accessible and Adobe (in particular) doesn’t care about the problem. Really, we need these complainers to grow up and face facts. You can’t ask for a format to be upgraded to include accessibility and then complain that your own software has to be upgraded.
If you don’t want to fork over the money for a Windows screen reader, you can use Mac or Linux. VoiceOver on Mac OS X 10.4 Tiger can read PDF 1.5 files and earlier, though not always very well. The Sun accessibility package for Linux (part of Solaris 10), which is free of charge, includes built-in screen reading. There’s now a version of Adobe Reader 7 for Linux, though it doesn’t have speech output.
If you’re really concerned about cost, install the free Linux software or install Tiger on a used Mac. (Actually, a new Mac Mini without monitor costs less than a new license for Jaws.) If you think that Windows-screen-reader makers are overcharging, complain to the makers or vote with your feet. The actual issue – PDF accessibility – is being handled.
Let’s be consistent about screen-reader flaws
Also, if you’re going to complain about how long it’s taking screen readers to handle PDF, even though that problem is behind us, let’s look at how well screen readers handle HTML.
The reality is that HTML is a stable standard that screen readers have had a long time to get right. (HTML 4.01 was published in 1999, XHTML 1.0 in 2000 [revised 2002], XHTML 1.1 in 2001.) But in reality, screen readers are still catching up. How is this different from PDF support? It isn’t, except in one way: It’s worse, because HTML has been around longer. PDF support went from nothing to pretty good in the space of two years, while screen readers are still moping along barely able to handle the full HTML spec.
If you’re trying to suggest that the combination of PDF-plus-screen-reader is a problem, what happens if HTML-plus-screen-reader is also a problem? The complaint that screen readers have trouble with PDF and no trouble with HTML is false both ways. Why don’t we hear any complaints about having to upgrade screen readers to handle HTML?
Let’s look at the evidence.
|Version||HTML support added|
|Version||HTML support added|
|Version||HTML support added|
|Version||HTML support added|
|Version||HTML support added|
|Version||HTML support added|
- IBM and GW Micro have a habit of ritually destroying release notes for previous versions when new ones come out. Here, Window-Eyes release notes were gathered mostly through Internet Archive documents. A review of HPR 2.5 was used.
- Jaws 5.0 is not listed above. It was documented only in rambling audio files (14 MB
.exe). Apparently it added support for lists and
blockquote(used for “indentation,” the recordings tell us).
- HTML support in Hal is difficult to ascertain even after Dolphin Computer Access sent me the various release notes. Version 6.51 fixed a Flash problem and a problem with an offscreen-positioned skip-navigation link (marginally relevant to spec support); version 6.03 announced, hence recognized, links, frames and headings, also
So you see, HTML support in screen readers has evolved and is still evolving. But all of a sudden when screen readers had to be upgraded to handle PDF, some critics pretended that such upgrades were unreasonable and unique. You’ve been upgrading your screen readers all along just to handle HTML documents using specifications that are up to six years old.
Where PDF accessibility falls down embarrassingly is with “authoring tools,” the software used to create PDFs. Only a few programs can natively create a tagged PDF file, including InDesign; PageMaker 7.0 (!); FrameMaker 6.0 and later; and Microsoft Office with an Adobe export plug-in (Office 2000 and later only, Windows only). Products that use PDFlib 6.0 and later can produce tagged PDFs. There may be a few other minor utilities here and there.
The average person, however, will be faced with touching up an untagged or poorly-tagged original. You pretty much have no choice but to use the tagging function built into Adobe Acrobat (the full version, not just the Reader, and for some functions you need the Pro version). There are already a few not-very-helpful tutorials on tagging with Acrobat, and, at the risk of disappointing my readers, I’m not going to write another one, as life is too short. However, the basics of what you have to do are easy to state:
- Open your PDF.
- The Description pane of the Document Properties screen (File menu) will tell you if the document is tagged or not.
- If it isn’t, dismiss that screen. Go to the Advanced menu and choose Accessibility → Add Tags to Document.
- Run a full accessibility check from that same menu.
- If the checker reports any problems, open the little-known Tags palette (View → Navigation Tabs → Tags).Use the disclosure triangles to step through your document’s new tag structure. You’re better off if you select Highlight Content from the palette’s Options menu, as Acrobat will then draw a hard-to-see border around the object whose tag you select.
To handle the most common problems:
- If Acrobat complains that your document lacks a language specification, find the topmost tag in your document (immediately within the self-referential Tags tag). Right- or Ctrl-click it and select Properties. Select a language from the pop-up menu in the Language field, or type your own two-letter language code.
- For images lacking a text equivalent, do something similar, except you have to manually locate the Figure element that lacks the text equivalent. Context-click the Figure, select Properties, and fill in Alternate Text (exactly like
altin HTML) or Actual Text (for a picture of text).
- A document from a printed source may contain “artifacts” like headers and footers that you never want screen-reader users to hear. You can context-click on those items (which may be deemed
P, or something else) and Create Artifact, which will cause Acrobat and compliant screen readers to ignore them when voicing. (You can also use the Touch-Up Reading Order tool to select the artifact on the actual page and mark it as Background.)
If this task seems tedious, it is, and it’s also quite inaccessible to many people with disabilities. Remember, we are not working toward a web in which nondisabled people create content for disabled people; people with disabilities must also be creators.
Acrobat is an unusual program in that it must arguably comply both with the Authoring Tools Accessibility Guidelines and with the User Agent Accessibility Guidelines, because you can create and view content using Acrobat. (And PDFs themselves are subject to Web Content Accessibility Guidelines; they will be covered in WCAG 2.0, which is expected to be technology-neutral.) Acrobat and PDF are not fully compliant with any of those guidelines, but few things are – and, when it comes to ATAG, nothing is.
PDF accessibility is not as straightforward as HTML accessibility. But we need to stand up to the untruths that are spoken about PDF, especially since many of those untruths come from authorities with the power to find authors guilty of discrimination.
PDF accessibility is OK some of the time when it’s handled by competent authors with what few tools are available. All of those components need improvement, but let’s not pretend we don’t already have the power to create accessible PDFs. We do.
- Jacques Distler
- Andy Dulson
- Loretta Guarino Reid
- Phill Jenkins
- Greg Pisocky
- Ted Padova