Illustration of a man fishing off of a pier
Illustration by

Tags to DOM

In our previous segment, “Server to Client,” we saw how a URL is requested from a server and learned all about the many conditions and caches that help optimize delivery of the associated resource. Once the browser engine finally gets the resource, it needs to start turning it into a rendered web page. In this segment, we focus primarily on HTML resources, and how the tags of HTML are transformed into the building blocks for what will eventually be presented on screen.

Article Continues Below

To use a construction metaphor, we’ve drafted the blueprints, acquired all the permits, and collected all the raw materials at the construction site; it’s time to start building!

Parsing#section2

Once content gets from the server to the client through the networking system, its first stop is the HTML parser, which is composed of a few systems working together: encoding, pre-parsing, tokenization, and tree construction. The parser is the part of the construction project metaphor where we walk through all the raw materials: unpacking boxes; unbinding pallets, pipes, wiring, etc.; and pouring the foundation before handing off everything to the experts working on the framing, plumbing, electrical, etc.

Encoding#section3

The payload of an HTTP response body can be anything from HTML text to image data. The first job of the parser is to figure out how to interpret the bits just received from the server. Assuming we’re processing an HTML document, the decoder must figure out how the text document was translated into bits in order to reverse the process.

Binary-to-text representation
Characters D O M
ASCII Values 68 79 77
Binary Values 01000100 01001111 01001101
Bits 8 8 8

(Remember that ultimately even text must be translated to binary in the computer. Encoding—in this case ASCII encoding—defines that a binary value such as “01000100” means the letter “D,” as shown in the figure above.) Many possible encodings exist for text—it’s the browser’s job to figure out how to properly decode the text. The server should provide hints via Content-Type headers, and the leading bits themselves can be analyzed (for a byte order mark, or BOM). If the encoding still cannot be determined, the browser can apply its best guess based on heuristics. Sometimes the only definitive answer comes from the (encoded) content itself in the form of a <meta> html tag. Worst case scenario, the browser makes an educated guess and then later finds a contradicting <meta> tag after parsing has started in earnest. In these rare cases, the parser must restart, throwing away the previously decoded content. Browsers sometimes have to deal with old web content (using legacy encodings), and a lot of these systems are in place to support that.

When saving your HTML documents for the web today, the choice is clear: use UTF-8 encoding. Why? It nicely supports the full Unicode range of characters, has good compatibility with ASCII for single-byte characters common to languages like CSS, HTML, and JavaScript, and is likely to be the browser’s fallback default. You can tell when encoding goes wrong, because text won’t render properly (you will tend to get garbage characters or boxes where legible text is usually visible).

Pre-parsing/scanning#section4

Once the encoding is known, the parser starts an initial pre-parsing step to scan the content with the goal of minimizing round-trip latency for additional resources. The pre-parser is not a full parser; for example, it doesn’t understand nesting levels or parent/child relationships in HTML. However, the pre-parser does recognize specific HTML tag names and attributes, as well as URLs. For example, if you have an <img src="https://somewhere.example.com/​images/​dog.png" alt=""> somewhere in your HTML content, the pre-parser will notice the src attribute, and queue a resource request for the dog picture via the networking system. The dog image is requested as quickly as possible, minimizing the time you need to wait for it to arrive from the network. The pre-parser may also notice certain explicit requests in the HTML such as preload and prefetch directives, and queue these up for processing as well.

Tokenization#section5

Tokenization is the first half of parsing HTML. It involves turning the markup into individual tokens such as “begin tag,” “end tag,” “text run,” “comment,” and so forth, which are fed into the next state of the parser. The tokenizer is a state machine that transitions between the different states of the HTML language, such as “in tag open state” (<|video controls>), “in attribute name state” (<video con|trols>), and “after attribute name state” (<video controls|>), doing so iteratively as each character in the HTML markup text document is read.

(In each of those example tags, the vertical pipe illustrates the tokenizer’s position.)

Diagram showing HTML tags being run through a tokenizer to create tokens

The HTML spec (see “12.2.5 Tokenization”) currently defines eighty separate states for the tokenizer. The tokenizer and parser are very adaptable: both can handle and convert any text content into an HTML document—even if code in the text is not valid HTML. Resiliency like this is one of the features that has made the web so approachable by developers of all skill levels. However, the drawback of the tokenizer and parser’s resilience is that you may not always get the results you expect, which can lead to some subtle programming bugs. (Checking your code in the HTML validator can help you avoid bugs like this.)

For those who prefer a more black-and-white approach to markup language correctness, browsers have an alternate parsing mechanism built in that treats any failure as a catastrophic failure (meaning any failure will cause the content to not render). This parsing mode uses the rules of XML to process HTML, and can be enabled by sending the document to the browser with the “application/xhtml+xml” MIME type (or any XML-based MIME type that uses elements in the HTML namespace).

Browsers may combine the pre-parser and tokenization steps together as an optimization.

Parsing/tree construction#section6

The browser needs an internal (in-memory) representation of a web page, and, in the DOM standard, web standards define exactly what shape that representation should be. The parser’s responsibility is to take the tokens created by the tokenizer in the previous step, and create and insert the objects into the Document Object Model (DOM) in the appropriate way (specifically using the twenty-three separate states of its state machine; see “12.2.6.4 The rules for parsing tokens in HTML content”). The DOM is organized into a tree data structure, so this process is sometimes referred to as tree construction. (As an aside, Internet Explorer did not use a tree structure for much of its history.)

Diagram showing tokens being turned into the DOM

HTML parsing is complicated by the variety of error-handling cases that ensure that legacy HTML content on the web continues to have compatible structure in today’s modern browsers. For example, many HTML tags have implied end tags, meaning that if you don’t provide them, the browser auto-closes the matching tag for you. Consider, for instance, this HTML:

<p>sincerely<p>The authors</p>

The parser has a rule that will create an implied end tag for the paragraph, like so:

<p>sincerely</p><p>The authors</p>

This ensures the two paragraph objects in the resulting tree are siblings, as opposed to one paragraph object by ignoring the second open tag. HTML tables are perhaps the most complicated where the parser’s rules attempt to ensure that tables have the proper structure.

Despite all the complicated parsing rules, once the DOM tree is created, all of the parsing rules that try to create a “correct” HTML structure are no longer enforced. Using JavaScript, a web page can rearrange the DOM tree in almost any way it likes, even if it doesn’t make sense! (For example, adding a table cell as the child of a <video> tag). The rendering system becomes responsible for figuring out how to deal with any weird inconsistencies like that.

Another complicating factor in HTML parsing is that JavaScript can add more content to be parsed while the parser is in the middle of doing its job. <script> tags contain text that the parser must collect and then send to a scripting engine for evaluation. While the script engine parses and evaluates the script text, the parser waits. If the script evaluation includes invoking the document.write API, a second instance of the HTML parser must start running (reentrantly). To quickly revisit our construction metaphor, <script> and document.write require stopping all in-progress work to go back to the store to get some additional materials that we hadn’t realized we needed. While we’re away at the store, all progress on the construction is stalled.

All of these complications make writing a compliant HTML parser a non-trivial undertaking.

Events#section7

When the parser finishes, it announces its completion via an event called DOMContentLoaded. Events are the broadcast system built into the browser that JavaScript can listen and respond to. In our construction metaphor, events are the reports that various workers bring to the foreman when they encounter a problem or finish a task. Like DOMContentLoaded, there are a variety of events that signal significant state changes in the web page such as load (meaning parsing is done, and all the resources requested by the parser, like images, CSS, video, etc., have been downloaded) and unload (meaning the web page is about to be closed). Many events are specific to user input, such as the user touching the screen (pointerdown, pointerup, and others), using a mouse (mouseover, mousemove, and others), or typing on the keyboard (keydown, keyup, and keypress).

The browser creates an event object in the DOM, packs it full of useful state information (such as the location of the touch on the screen, the key on the keyboard that was pressed, and so on), and “fires” that event. Any JavaScript code that happens to be listening for that event is then run and provided with the event object.

The tree structure of the DOM makes it convenient to “filter” how frequently code responds to an event by allowing events to be listened for at any level in the tree (i.e.., at the root of the tree, in the leaves of the tree, or anywhere in between). The browser first determines where to fire the event in the tree (meaning which DOM object, such as a specific <input> control), and then calculates a route for the event starting from the root of the tree, then down each branch until it reaches the target (the <input> for example), and then back along the same path to the root. Each object along the route then has its event listeners triggered, so that listeners at the root of the tree will “see” more events than specific listeners at the leaves of the tree.

Diagram showing a route being calculated for an event, and then event listeners being called

Some events can also be canceled, which provides, for example, the ability to stop a form submission if the form isn’t filled out properly. (A submit event is fired from a <form> element, and a JavaScript listener can check the form and optionally cancel the event if fields are empty or invalid.)

DOM#section8

The HTML language provides a rich feature set that extends far beyond the markup that the parser processes. The parser builds the structure of which elements contain other elements and what state those elements have initially (their attributes). The combination of the structure and state is enough to provide both a basic rendering and some interactivity (such as through built-in controls like <textarea>, <video>, <button>, etc.). But without the addition of CSS and JavaScript, the web would be very boring (and static). The DOM provides an additional layer of functionality both to the elements of HTML and to other objects that are not related to HTML at all.

In the construction metaphor, the parser has assembled the final building—all the walls, doors, floors, and ceilings are installed, and the plumbing, electrical, gas, and such, are ready. You can open the doors and windows, and turn the lights on and off, but the structure is otherwise quite plain. CSS provides the interior details—color on the walls and baseboards, for example. (We’ll get to CSS in the next installment.) JavaScript enables access to the DOM—all the furniture and appliances inside, as well as the services outside the building, such as the mailbox, storage shed and tools, solar panels, water well, etc. We describe the “furniture” and outside “services” next.

Element interfaces#section9

As the parser is constructing objects to put into the tree, it looks up the element’s name (and namespace) and finds a matching HTML interface to wrap around the object.

Interfaces add features to basic HTML elements that are specific to their kind or type of element. Some generic features include:

  • access to HTML collections representing all or a subset of the element’s children;
  • the ability to search the element’s attributes, children, and parent elements;
  • and importantly, ways to create new elements (without using the parser), and attach them to (or detach them from) the tree.

For specific elements like <table>, the interface contains additional table-specific features for locating all the rows, columns, and cells within the table, as well as shortcuts for removing and adding rows and cells from and to the table. Likewise, <canvas> interfaces have features for drawing lines, shapes, text, and images. JavaScript is required to use these APIs—they are not available using HTML markup alone.

Any DOM changes made to the tree via the APIs described above (such as the hierarchical position of an element in the tree, the element’s state by toggling an attribute name or value, or any of the API actions from an element’s interface) after parsing ends will trigger a chain-reaction of browser systems whose job is to analyze the change and update what you see on the screen as soon as possible. The tree maintains many optimizations for making these repeated updates fast and efficient, such as:

  • representing common element names and attributes via a number (using hash tables for fast identification);
  • collection caches that remember an element’s frequently-visited children (for fast child-element iteration);
  • and sub-tree change-tracking to minimize what parts of the whole tree get “dirty” (and will need to be re-validated).

Other APIs#section10

The HTML elements and their interfaces in the DOM are the browser’s only mechanism for showing content on the screen. CSS can affect layout, but only for content that exists in HTML elements. Ultimately, if you want to see content on screen, it must be done through HTML interfaces that are part of the tree.” (For those wondering about Scalable Vector Graphics (SVG) and MathML languages—those elements must also be added to the tree to be seen—I’ve skipped them for brevity.)

We learned how the parser is one way of getting HTML from the server into the DOM tree, and how element interfaces in the DOM can be used to add, remove, and modify that tree after the fact. Yet, the browser’s programmable DOM is quite vast and not scoped to just HTML element interfaces.

The scope of the browser’s DOM is comparable to the set of features that apps can use in any operating system. Things like (but not limited to):

  • access to storage systems (databases, key/value storage, network cache storage);
  • devices (geolocation, proximity and orientation sensors of various types, USB, MIDI, Bluetooth, Gamepads);
  • the network (HTTP exchanges, bidirectional server sockets, real-time media streaming);
  • graphics (2D and 3D graphics primitives, shaders, virtual and augmented reality);
  • and multithreading (shared and dedicated execution environments with rich message passing capabilities).

The capabilities exposed by the DOM continue to grow as new web standards are developed and implemented by major browser engines. Most of these “extra” APIs of the DOM are out of scope for this article, however.

Moving on from markup#section11

In this segment, you’ve learned how parsing and tree construction create the foundation for the DOM: the stateful, in-memory representation of the HTML tags received from the network.

With the DOM model in place, services such as the event model and element APIs enable web developers to change the DOM structure at any time. Each change begins a sequence of “re-building” work of which updating the DOM is only the first step.

Going back to the construction analogy, the on-site raw materials have been formed into the structural framing of the building and built to the right dimensions with internal plumbing, electrical, and other services installed, but with no real sense yet of the building’s final look—its exterior and interior design.

In the next installment, we’ll cover how the browser takes the DOM tree as input to a layout engine that incorporates CSS and transforms the tree into something you can finally see on the screen.

About the Author

Travis Leithead

Travis works on the Microsoft Edge web platform, focusing on DOM APIs and their related standards. He helps coordinate standards engagement and serves as an elected member of the W3C’s technical architecture group where he primarily reviews proposals for new web standards.

3 Reader Comments

  1. I tought pre-parsing (scanner) was only active when DOM construction was paused while reading a blocking JS, and was not a mandatory step ie. there would be no pre-parsing at all if there is no blocking JS to load. Edge (and all other browers?) are pre-parsing the full page to collect images (and JS?) urls and add it to the loading list by (which?) priority. Did I understand correctly?

Got something to say?

We have turned off comments, but you can see what folks had to say before we did so.

More from ALA

I am a creative.

A List Apart founder and web design OG Zeldman ponders the moments of inspiration, the hours of plodding, and the ultimate mystery at the heart of a creative career.
Career