Validating a Custom DTD

In his article in this issue, Peter-Paul Koch proposes
adding custom attributes to form elements to allow triggers for specialized
behaviors. The W3C validator won’t validate a document with these
attributes, as they aren’t part of the XHTML specification.

Article Continues Below

This
article will show you how to create a custom
DTD that will add those
custom attributes, and will show you how to validate documents that use those
new attributes. Here is a sample of the HTML with the custom attributes that
let us specify the maximum length of a text area and whether a form element
is required or not:

<form>
<p>
  Name:
  <input type="text" name="yourName" size="40" />
</p>
<p>
  Email:
  <input type="text" name="email" size="40"
  <span class="highlight">required="true" />
</p>
<p>
  Comments:
<textarea <span class="highlight">maxlength="300" required="false" rows="7" cols="50"></textarea> </p> <p> <input type="submit" value="Send Data" /> </p> </form>

What’s a DTD?#section1

A Document Type Definition (DTD) is a file that
specifies which elements and attributes exist in a markup language and
where they can appear. Thus, the XHTML DTD specifies that
<p> is a valid element, and that it can appear
inside a <div>, but not inside a <b>.
The URL at the end of your DOCTYPE declaration points
to a place where you will find the DTD for the flavor of HTML you’re
using. Neither your browser nor the W3C Validator goes out to the web to find
the DTD — they have a “wired-in” list of the valid
DOCTYPEs and use the URL for identification purposes only. As you will see
later, this will change when you make a custom DTD.

Specifying the attributes#section2

Adding attributes to an existing DTD is easy. For each attribute, you
need to specify which element it goes with, what the attribute name is,
what type of values it may have, and whether the attribute is optional or
not.  This information is specified in this model:

<!ATTLIST
  elementName attributeName type optionalStatus
>

To add the maxlength attribute to the
<textarea> element, you write this:

<!ATTLIST textarea maxlength CDATA #IMPLIED>

The CDATA specification means that the attribute value
can contain any old character data you please; thus
maxlength=“300” or maxlength=“ten” will both
be valid. For “open-ended” data, DTDs don’t let you
get more specific.  The #IMPLIED specification means that
the attribute is optional.  A required attribute would specify
#REQUIRED.

When you have a list of possible values for an attribute, you may specify
them in the DTD.  This is the case with the attribute named
required,
which has the values true and false. The values
are case sensitive; in this example only the lowercase values are specified, so
a value of TRUE would not be considered valid.

<!ATTLIST textarea required (true|false) #IMPLIED>

Confusion alert! This attribute is named “required,”
but you don’t have to put it on every <textarea>
element, so it’s an optional attribute.

The attribute named required should also be available to the
<input> and <select> elements. All
in all, the specifications to modify the DTD look like this:

<!ATTLIST textarea maxlength CDATA #IMPLIED>
<!ATTLIST textarea required (true|false) #IMPLIED>
<!ATTLIST input required (true|false) #IMPLIED>
<!ATTLIST select required (true|false) #IMPLIED>

Note: Adding new attributes to existing
elements is easy; adding new elements is somewhat more difficult and beyond
the scope of this article.

Placing the attributes#section3

Now that you’ve defined the custom attributes, how do you place
them where a validator can find them?  The very best place to put them
would be as the
internal subset
directly in your document:

<!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[
  <!ATTLIST textarea maxlength CDATA #IMPLIED>
  <!ATTLIST textarea required (true|false) #IMPLIED>
  <!ATTLIST input required (true|false) #IMPLIED>
  <!ATTLIST select required (true|false) #IMPLIED>
]>

If you run such a file through the W3C
validator, you find that it validates wonderfully well.
If you download the sample files for this article and validate
file internal.html, you can see this for yourself.
Unfortunately,
when you display the file in a browser, the ]>
shows up on the screen.  There’s no way around this bug, so this
approach is right out.

Modifying the DTD#section4

An approach that does workrequires you to obtain the
XHTML transitional DTD and add your modifications to that file.
The original version of the DTD is file
xhtml1-transitional.dtd in directory dtd
from this article’s sample files.  You will also find
three files with the .ent extension in that
directory. These three files
define all the entities that you use in HTML,
such as and ñ. You
need to keep all these files together in the same directory.

The customized file, named xhtml1-custom.dtd was
created by opening file xhtml1-transitional.dtd and
adding the new attribute specifications at the end of the file. When
adding attributes, you
want to add your customizations at the end of the DTD to
ensure that everything they need to reference
has already been defined.

Changing the DOCTYPE#section5

You must now change the <!DOCTYPE> in your HTML
file to indicate that you are now using this custom “flavor”
of XHTML.
Since the custom DTD isn’t one of the publicly registered ones,
the DOCTYPE will not use the PUBLIC specifier. Instead,
you use the keyword SYSTEM followed by the location of the
custom DTD. This may be a relative or absolute path name, or, if your
DTD is on a server, a URL.  The path must point to where your
custom DTD really is!
File custom.html in the sample files for this article
uses a relative path name:

<!DOCTYPE html SYSTEM
   "dtd/xhtml1-custom.dtd">

When you try to use the W3C validator on
custom.html, it rejects
the document because you aren’t using one of the validator’s
approved DTDs.

Using a different validator#section6

The solution is to use a different validator which will actually go
out to the URL that you have specified and use it to check whether your
document is valid or not.
Because the document you’re validating is XHTML,
you can use any XML parser that
does validation. This article will uses the Xerces parser,
available from
xml.apache.org.  This parser is written in
Java™, so you will need to have Java installed on your system.
When you unzip the Xerces download file, it will create a directory named
xerces-2_6_2 (or whatever version is current).  In the
following text, the assumption is that you have unzipped it to the top
level of the
C: drive on Windows or to /usr/local on Linux.

One of the sample files that comes
with Xerces is the Counter program. This program
counts the number of elements,
attributes, ignorable whitespaces, and characters appearing in
an XML (or, in this case, XHTML) document. This program has an option
to turn on validation as it parses the document, making it perfect for
the task at hand.
You run the Counter program (which is going to be your
“validator”) from
a batch file for Windows or a shell script for Linux.
Here is the
batch file, named
validate.bat.
It is all on one line, but shown here split across lines to
fit on the page. Please note: there is a blank before the word
dom and after the -v.

java -cp c:xerces-2_6_2xercesImpl.jar; »
c:xerces-2_6_2xmlParserAPIs.jar; »
c:xerces-2_6_2xercesSamples.jar dom/Counter -v »
%1 %2 %3 %4 %5 %6 %7 %8

Here is the Linux shell script, named validate.sh.

java -cp /usr/local/xerces-2_6_2/xercesImpl.jar:\
/usr/local/xerces-2_6_2/xmlParserAPIs.jar:\
/usr/local/xerces-2_6_2/xercesSamples.jar \
dom/Counter -v $1 $2 $3 $4 $5 $6 $7 $8

Of course, if you have unzipped Xerces to a different location, you
will have to change the path names.
Once this is all set up, you can validate the file
custom.html by typing
this on a Windows command line:

validate custom.html

Or this at a Linux shell prompt:

./validate.sh custom.html

If your file is valid, you will receive a message giving the
filename and some statistics about the file, like this:

custom.html: 543;50;0 ms
  (15 elems, 20 attrs, 9 spaces, 43 chars)

If the file isn’t valid, you will get error messages as well.
For example, if you try to validate a file named badfile.html
which contains these errors:

<p>Email: <input type="text" name="email" size="40"
 required="<span class="highlight">yes" /></p>
<p>Comments:
<textarea maxlength="300" <span class="highlight">inquirer="false" rows="7" cols="50"></textarea>

You’ll get this output from the validator:

[Error] badfile.html:12:70: Attribute "required"
  with value "yes" must have a value from the
  list "true false ".
[Error] badfile.html:14:63: Attribute "inquirer"
  must be declared for element type "textarea"
badfile.html:
  611;82;0 ms (15 elems, 20 attrs, 9 spaces, 43 chars)

Another validation method#section7

If you are using the
jEdit editor,
you may download the XML plugin. If you name your file with the
extension .xhtml, jEdit will validate using your custom
DTD as specified in the DOCTYPE.

Conclusion#section8

It is easy to specify additional attributes for XHTML elements; with a little
bit of work, you can set up a validator to check your files against your
custom version of HTML.  Download all the
sample files from this article
and give it a whirl.

25 Reader Comments

  1. For implementation of certain parts Web Forms 2.0 (http://www.whatwg.org/specs/)? Most of these attributes could be processed by javascript and servers instead of the UA.

    Now we’re starting to see the true benefits of XML. But I wonder what are the benefits here for sites serving text/html? Technically its allowed for XHTML 1.0 and javascript should work the same, but all this effort really does is trick the validating programs–tag soup is tag soup is tag soup. Altering your DTD won’t make a difference one way or another unless browsers look at your page as XML, which only the most zen developers even dare think about.

  2. I agread with Ryan ! This solution is realy interresting with the XML technologie (and not HTML)… unfortunatly, there are few server able to display HTML page as XML ! (it’s a pity because it’s not so difficult)

    Just a small critic about the article itself, it were better to build your exemples with the XHTML 1.1 recomendation that is technicaly build to make such things. The modularization is an interresting way to customize DTD.

  3. Of course, does your site validate against your own custom DTD.

    But I thought, that the Idea behind that validating of your Online-Document against a public Schema should be, that we show, that we care for COMMON standarts.

    As the local Part of a DTD possibly overwrites ANY element- or Attribute-Declartion with your original DTD and designing your own DTD gives you the same powers, anyone could add, modify or delete ANY Element or Attribute and still your documents validate perfectly.
    As browser tend to ignore unknown tags and even allow you to style them with CSS everyone can happily design his own Markuplanguage.

    This is, what the XML-idea was all about….

    The only interesting question is:

    “Are there any good reasons to create your docuements in away, that they validate against PUBLIC COMMON Schemas?”

    Greetings Benjamin

    Anyway, still the best SGML/XML/DTD Parser/Validator is James Clarks “nsgmls”:
    http://www.jclark.com/sp/index.htm

  4. Support for XHTML and “at-the-browser” XML parsing is very limited right now, which is why in the battle of “HTML vs. XHTML,” HTML is the winner because it is more widely supported. I completely agree that XML is very powerful and flexible, but you should parse your XML and output it as HTML 4.01 strict until at-the-browser parsing is not so iffy. Otherwise, an excellent article at explaining something that is not very well-known.

  5. After I added my
    < !DOCTYPE html SYSTEM "dtd/xhtml1-custom.dtd" >
    tag to the top of my document a curious oddity manifested itself. All my class attribute values became case sensitive. I suppose it’s my own fault for putting capitisation on my class name (coretable vs CoreTable) but it’s still a bit weird though imho.

  6. If I ever wanted to use this tip, I’d really want to use the inline DTD stuff instead of creating a new page and referencing that. But of course the issue with ]> showing up in browsers is bad.

    I haven’t actually tested this yet, but I would expect that serving your page with a MIME type of ‘text/xml+xhtml’ would fix this issue for at least Safari and Mozilla, since, IIRC, using that MIME type causes those browsers to use a real XML parser. Of course, the downside is if the page isn’t well-formed then it’s not displayed at all, but the upside is you can do whatever you want that’s legal in XML, including things like declaring new entities inline in the DOCTYPE and using them in the page (which might be handy).

  7. Just to tickle Kevin, the fact that a bad-formed XML document is not displayed is not a downside but a plus. Each and every other programming language won’t let you compile/execute fautive code. Everything is simpler that way, there’s no guesswork involved.

    Yes, at first it seems harder for the programmer, but in the end it makes everything easier because you don’t waste time battling different browser interpretations of a missing .

  8. How do custom DTDs affect doctype sniffing (for purposes of deciding between rendering in quirks or standard mode)? Do browsers sniff for *any* doctype or specific ones? I.e. will using custom DTDs make the rendering model behave differently?

    Also, why can’t we use namespace switching instead? As long as IE (and other browsers) simply ignore the switching, and the validator is pacified by them, everyone is happy, right?

  9. Ah, that explains why I though I was going crazy in 2003 and could not figure out why the “]>” appeared. Thus I finally use a separate DTD Fragment file and used:

    < ![INCLUDE[
    %xhtmldtd;]]>

    However, you’re correct when served as “application/xhtml+xml” no appearance of “]>” on canvas.

    Though still I am happy with the method I finally used 2003 as it was cleaner for multiple files.

  10. Please bear in mind, I’ve been a standards buff for quite a while, so I’m not ragging on the ideal of clean code that validates. That having been said:

    The idea behind the standards movement was to ensure that developers could code efficiently by leveraging a common set of languages that all browsers would respond predicatably (and identically) to. This is obviously an ideal, and we aren’t there yet, but we’re a heckuva lot closer than we were 5 years ago (thanks in no small part to this magazine and its founders).

    Now that we have these standards, why would we then splinter, and create “personal” standards?

    Validation in and of itself has little *practical* value. It is a goal I strive for, to be sure, and it certainly feels good (it’s a job well done on large-scale sites, that’s certain).

    But if the reason I don’t make it is either not under my control (CMS output) or if it makes the site better (as in the custom attributes mentioned in the previous article, or hacks required to overcome browser bugs) I’d rather forgo validation and uphold the common standard we’ve fought for.

  11. Tim’s got an important point about the danger of splintering standards that took a lot of work to get in place. On the other hand, I see a very practical use for this technique.

    My team produces e-learning modules. The practical value of the standards for us is quality control: if every html page in the module validates, that’s one more assurance the client isn’t getting something that’s broken.

    But: our work requires that we use. One of our clients still has some of its employees using Netscape 4.7. won’t cut it. So when we validate, we ignore the multiple errors generated by thetag.

    That means: every time we validate we need to manually read the error messages and make sure that they came fromand not from something else. Which means that someday, we’re going to trip up and ignore an error we should have fixed.

    Here’s where the custom DTD comes in: we validate against a custom DTD that allows thetag. When we validate,is silently passed by, and we know that any errors we see are errors we need to fix. We’ll likely do the same for the custom tags generated by Macromedia’s CourseBuilder (although placing them in a custom namespace would be more in the spirit of XHTML).

    That said, it’s still troubling to work this way. I haven’t tried it out, but I’m guessing that some browsers will remain in quirks mode unless they see one of the public doctypes. For cross-platform consistency, we need to avoid quirks mode. For this reason, we may end up validating against our custom DTD but delivering pages with the XHTML Strict or Transitional doctype. It ain’t perfect, but it’ll do what we need to do for our clients.

  12. Tim Murtaugh wrote:

    >>>Now that we have these standards, why would we then splinter, and create “personal” standards?< << It's not splintering any more than using classes and id:s in regular HTML is. XML *is* the standard, and the whole point of XML (well, one of them) is that you can create your own elements and define your own languages. This is *not* in *any* way against the ideal of common standards. The reason we've been taught not to use proprietary markup like and isn’t that inventing elements without W3C’s blessing is evil in itself, but because they weren’t in the doctype. If you write your own doctype, the new elements *are* in the doctype. That’s the only important difference; the XML reader knows how to treat your elements.

    However, writing your own doctypes is, with a few exceptions like this article, probably of little use if you’re just targetting common web browsers, since the most common of them all, and many others, won’t know what to do with them. “Doctype switching” (browsers changing their rendering depending on the doctype) probably won’t work either.

    Footnote: Of course, what we refer to as “standards” on the web usually *aren’t* actual standards, but that’s beside the point.

  13. Peterman wrote:
    >Footnote: Of course, what we refer to as “standards” on the web usually *aren’t* actual standards, but that’s beside the point.

    True. But if The WaSP hadn’t persuaded designers, developers, and browser makers to treat these W3C and ECMA specifications as “baseline standards,” our support for CSS, ECMAScript, XHTML and the DOM would probably be little better than it was in 1998, and we’d be coding our sites 28 ways, to accommodate multiple generations of proprietary browserdom.

  14. Just as an example of a site that do use custom a DTD.

    The swedish ibm site uses a custom DTD (ibmxhtml1-transitional), the newer american version uses a standard W3C DTD (xhtml1-transitional). When I visit the two different sites in FF and looking at the properties it clearly states what mode the browser is in. So visiting the swedish site FF is in quirk mode. So if using a custon DTD it would make the browser slower and render the pages with quirk mode in mind, right?

    http://www.ibm.com/us/
    http://www.ibm.com/se/

    Jens Wedin

  15. I’m no XML guru, but why can’t we do this with an XSD? I’m envisioning something like this:

    < !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">






    To me, that would be the optimal approach, since we’re still including the original DTD, just extending it with our own XSD. Is there a reason this won’t work?

  16. Actually, custom Doctypes will trigger standards mode in Mozilla/Firefox (as well as in most other browsers, although haven’t tested this with safari).

    The IBM doctype in question is listed as recognized by Mozilla and is set to trigger quirks mode.

    http://www.mozilla.org/docs/web-developer/quirks/doctypes.html

    On the same page under Full Standards Mode you will see listed:

    “Any ‘DOCTYPE HTML SYSTEM’ as opposed to ‘DOCTYPE HTML PUBLIC’, except for the IBM doctype noted below”

  17. Coincidentally, I’ve just released a cross-platform graphical tool called GooeySAX that is a wrapper for the Xerces tool referenced in this article. GooeySAX allows you to validate your custom DTDs easily. It is also great for those times when your document is only available on a private network, and thus unreachable by the W3C’s web-based validation tool.

    http://ditchnet.org/gooeysax

  18. Hey,

    Just off the top of my head, removing the ugly “]>” at the top of documents with internal subsets can be done with a little [removed]

    with (document.body) for (var i=0;i<2;++i) if (/]>/.test(childNodes[ i ].data))removeChild(childNodes[ i ]);

    Of course, it’s not ideal, but seems to do the job. It loops over the first two elements since different browsers place the characters in different positions. Originally I thought to simply strip the first two body child nodes, but realized that’s more prone to worst-case scenarios.

    The script can even be placed in the head tags without being attached to an onload event, since it depends on contents that’re already parsed. Should remove the node as soon as it hits the script.

  19. Why go through all this trouble of creating custom stuff? Is your client really ready to pay for all this? Is it justified over time? Will the site structure stay the same long enough for this to work? I think for smaller websites this takes way to much time.

    Do like the idea though

  20. I’m confused? Why not use attributes in a different namespace? EG:

    <input type=”text” name=”yourName” myns:required=”true” />

    That way you aren’t touching the XHTML DTDs at all…

  21. So where can I find a module, library or app that will take some of my database schemas from my content manglement system and generate a valid DTD so I don’t shoot my food clean off (something a bit … well alot less expensive than XMLSpy ya’know) ?

Got something to say?

We have turned off comments, but you can see what folks had to say before we did so.

More from ALA