The printing press gave us type that was clearer and easier to read than that produced from a typewriter, because the typesetter had additional tools at his disposal—and knew how to use them. The web has cost us some of those tools.
Lack of tools and knowledge
There are two problems here. The first is that until HTML 4 came along, the web was missing almost all of these tools (it’s still missing many important ones).
But the larger problem is, now that they’re available, almost no one publishing on the web today knows how to use them—or often even knows of their existence.
Read this, though, and you’ll understand the answers to both problems far better than almost anyone else, including your English teachers.
Most HTML References Are Wrong
I’ve lost count of all the books, articles, and websites that claim an em dash is “
—”—but they’re all wrong. The entire range from
Ÿ are invalid characters, and consequently should not be used.
Since Netscape 4.x browsers don’t understand many of the named entity references (such as
’ for a right single quote), I’m not going to mention any of them here (though they have been used by A List Apart, bless its little heart).
Do we Decimal?
The most reliable way to insert special characters by far is to use decimal entity notation. Some characters have four methods of reference: named, decimal, hexadecimal, and UTF-8 (Unicode), but only the decimal form is reliable across browsers and platforms. Use the others if you wish, but only if you want to be bombarded by Netscape 4.x users complaining about your “corrupted” pages.
UTF-8 encoding to the rescue—almost
The only way to insert these characters (and any character beyond 127) properly without using entity codes is to use the UTF-8 character encoding (the default for XHTML and XML documents).
Unfortunately, very few text editors support this, and many more browsers choke on UTF-8 characters than do on named entities, so don’t use them unless you don’t give a hoot about Netscape 4 users.
(FrontPage and Dreamweaver don’t insert most of these characters properly, so don’t rely on their “insert symbol” tools either.)
Hyphens are Not Dashes
Stop! Go back and re-read the subhead above—at least 2–3 times—then let it sink in before continuing.
The sentence above illustrates the proper use of the hyphen and the two main types of dashes. They are not the same, and must not be confused with each other. In some fancy fonts the difference is more than just the width—hyphens have a distinct serif. If you don’t know the rules already, let’s review them. First, though, a definition:
An “em” is a unit of measurement defined as the point size of the font—12 point type uses a 12 point “em.” An “en” is one-half of an “em.”
Though some of the finer points in the rules are complex, their basic applications are clear-cut and their misuse easily identifiable. First, neither an em dash nor an en dash should be confused with the hyphen (-), which is used to join compound words together.
The correct use of em and en
The em dash (
—) is used to indicate a sudden break in thought (“I was thinking about writing a—what time did you say the movie started?”), a parenthetical statement that deserves more attention than parentheses indicate, or instead of a colon or semicolon to link clauses. It is also used to indicate an open range, such as from a given date with no end yet (as in “Peter Sheerin [1969—] authored this document.”), or vague dates (as a stand-in for the last two digits of a four-digit year).
Two adjacent em dashes (a 2-em dash) are used to indicate missing letters in a word (“I just don’t f——ing care about 3.0 browsers”).
Three adjacent em dashes (a 3-em dash) are used to substitute for the author’s name when a repeated series of works are presented in a bibliography, as well as to indicate an entire missing word in the text.
The en dash (
–) is used to indicate a range of just about anything with numbers, including dates, numbers, game scores, and pages in any sort of document.
It is also used instead of the word “to” or a hyphen to indicate a connection between things, including geographic references (like the Mason–Dixon Line) and routes (such as the New York–Boston commuter train).
It is used to hyphenate compounds of compounds, where at least one pair is already hyphenated (as in “Netscape 6.1 is an Open-Source–based browser.”). The Chicago Manual of style also states that it should be used “Where one of the components of a compound adjective contains more than one word,” instead of a hyphen (as in “Netscape 6.1 is an Open Source–based browser”). Both of these rules are for clarity in indicating exactly what is being modified by the compound.
Other sources also specify the use of an en dash when referring to joint authors, as in the “Bose–Einstein” paper. Some also prefer it to a hyphen when text is set in all capital letters.
Some typographers prefer to use an en dash surrounded by full spaces instead of an em dash. Others prefer to insert hair spaces on either side of the em dash, but this is problematic with some web browsers (see the section on spaces for more detail).
That hyphen you can insert with the key next to the zero on your keyboard is an ambiguous character suffering from an identity crisis. It can’t decide if it’s a hyphen, a minus, or an en dash—in fact, the Unicode specification describes it as “hyphen-minus” and defines very specific replacements for each of its personalities.
Use it if you need to insert a hyphen, but never for a minus (
−) or a dash, since it does not have the correct width for either, or the vertical position for the latter (compare “1+4-2=3” to “1+4−2=3”).
The soft hyphen (
­ a.k.a. “discretionary hyphen” and “optional hyphen”) is to be used for one purpose only—to indicate where a word may be broken at the end of a line. Otherwise, it is to remain invisible and not affect the appearance of the word.
Some browsers display it no matter where it falls, but this is not the correct behavior. Others in the past have recommended against its use because its behavior was not well-defined, but the HTML 4.01 spec makes its use and behavior clear and unambiguous.
Three other hyphen characters exist in Unicode, but are unfortunately not defined in the HTML entity set (although they should be):
- The non-breaking hyphen (
‑not in HTML) does just what its name implies.
- The hyphen character (
‐not in HTML) is meant to be used in place of the hyphen-minus when a hyphen is exactly the desired character.
- The hyphenation point (
‧not in HTML) is that bullet-like character you find in some dictionaries to separate syllables. That is its only use, but if you’re creating an online dictionary, using it will make your entries look more professional.
There are fifteen space characters defined in Unicode. That’s right, fifteen. Most aren’t defined in HTML, and you can ignore many of these. Though many should be a part of the web, let’s deal with the ones that are defined first.
A normal space (a.k.a. “word space”) is your trusty old friend coming in at
The non-breaking space (
), commonly found in otherwise-empty table cells, is safely referred to by either its numeric or named entity reference in all 3.0-level and higher browsers.
Off by a hair
I said earlier that some prefer to surround single em dashes with a hair space (
not in HTML), which is between one-tenth to one-sixteenth of an em, but it isn’t defined in HTML 4.01.
The thin space (
) is the most similar space character which is defined in HTML. It is supposed to be one-fifth of an em in width, but is almost always rendered much wider. The only font I’ve found with a correctly designed hair space in is Arial Unicode MS, and it renders both with almost exactly the same width.
Bottom-line: Unless you can be sure that your target audience has Arial Unicode MS installed, neither of these spaces has anything close to the desired and correct appearance.
Em and en spaces
The last two spaces in the HTML repertoire are the en space (
) and the em space (
). Can you guess how wide each is?
Both are visibly wider than a normal space, and once again, Arial Unicode MS is the only mainstream font that includes both, even though they are part of the official HTML 4.01 specification.
That leaves the spaces defined by Unicode but not HTML. Use them at your own risk:
Sometimes in typesetting you need to provide a hint that the computer can break a long word in a particular position without any other interpretation or visible indication. This is the zero width space (
​ not in HTML). It’s not defined in HTML 4.01, and it doesn’t work in IE unless you’re using Arial Unicode MS.
Its evil twin is the zero width no-break space (
﻿ not in HTML), which can (theoretically) be used to keep a word from breaking at that point. Strangely, though this isn’t defined in HTML either, it works as-designed in IE6/Win.
Don’t Misquote Me
I’m going to make life easy on you here (well, mostly). There are actually fourteen quotation characters. (Eighteen if you count the big, bold versions in the Dingbats section of Unicode.) I’m going to pretend that most of them don’t exist—you’ll only need them for foreign languages anyway.
Newspapers of (broken) record
Methods for correctly inserting curly quotes in web pages are not well understood. Do not, under any circumstances, use
• for curly quotes.
Don’t ever trust the 8-bit representations to be correct, because they almost certainly won’t be. The biggest problem is that many web browsers assume that 8-bit characters refer to the local character system, translating your curly quotes or dashes into Greek or accented Latin characters on other platforms. These same browsers always get the numeric entity references right.
And don’t ever try to
``fake it´´ with doubled-up grave accents and straight single quotes or acute accents, as most of the
``best-known newspapers" do.
‘for an opening single quote (Ctrl + ` ` in Word—that’s two grave accents—that character on the tilde key).
’for a closing single quote (or an apostrophe) (Ctrl + ‘’ in Word).
“for an opening double quote (Ctrl + ` ” in Word).
”for a closing double quote (Ctrl + ’ ” in Word).
I’ll bet you didn’t know this about HTML—the
<blockquote> elements are designed to have quote marks automatically inserted in the appropriate locations. No current browser does this by default, however, and even those that do when faced with the appropriate style sheet markup (as detailed in CSS) get it wrong, especially with curly quotes.
HTML 4.01 mandates that this occur for the
element, and advises authors against placing quotes manually, since this could result in double quotes.
My suggestion: Avoid the use of
<q> entirely until this is widely supported, and either do the same with
<blockquote> (possible because indented paragraphs are implied to be quotations by convention in the English language), or place the automatic quote code in your style sheet and tolerate the fact that some browsers will produce garbage. (Or, just define all of these quotes to be plain-old straight quotes, and avoid most of the problems.)
This is such a shame, since CSS can automatically apply one of the forgotten rules about quotes: When quoting multiple paragraphs, each one begins with an opening quote, but only the last paragraph has a closing quote. Oh, well…
Many people (most, from what I’ve observed) believe that curly single opening and double opening quotes are the correct symbols for feet and inches. If you are one of these people, put out your hand so I can slap it with a ruler.
The correct symbols to use are prime and double prime. They look similar to curly quotes in a few fonts, but are usually much more distinct. They never, ever look like commas. They are usually set at a slight angle of 75—80 degrees, and are also usually tapered from the top to the bottom.
A single prime is used to represent feet or minutes (
′ not in HTML 4.01), while a double prime is used to indicate inches or seconds (
″ not in HTML 4.01). (I won’t make you learn about the triple prime and the three reversed versions of these characters.)
Finally, here are some fine points on the use of the ellipsis (
- An ellipsis is most often used to indicate one or more missing words in a quotation. It is also used to indicate when a thought or quotation trails off.
- When it occurs at the end of a sentence, it should be treated in one of three ways, depending on usage:
- If the ellipsis is being used to indicate one or missing words in the sentence, then it should be followed by a period.
- If it indicates one or more missing sentences, then it should appear after the period of the preceding sentence, and with a space on either side.
- But if it indicates that the thought or quote is just trailing off at the end of a sentence, then only the ellipsis is used, to clarify that no words from a quotation were omitted, as would be the case if the additional period were there.
There are more shady characters lurking in the background, but the ones described above are the most common and important.
In conclusion, I could tell you about all the places where you should be using the one dot leader (
․ and not in HTML 4.01) instead of the plain old period, but that would just be too cruel. Besides, all you really need to know is the period’s official Unicode name—full stop.
- William Strunk: Elements of Style
- W3C HTML 4.01 entity definitions
- Unicode Consortium
- International System of Units
- NASA’s A Handbook for Technical Writers and Editors is of great help even if you don’t write about technical subjects.
- Jukka Korpela provides a great amount of detail on specific characters as part of a larger series on characters, and a buttload of additional web authoring—related information.
- Got a detailed question about which characters allow line breaks to occur? An update to the Unicode specification has all the answers