In our previous segment, “Server to Client,” we saw how a URL is requested from a server and learned all about the many conditions and caches that help optimize delivery of the associated resource. Once the browser engine finally gets the resource, it needs to start turning it into a rendered web page. In this segment, we focus primarily on HTML resources, and how the tags of HTML are transformed into the building blocks for what will eventually be presented on screen.
To use a construction metaphor, we’ve drafted the blueprints, acquired all the permits, and collected all the raw materials at the construction site; it’s time to start building!
Once content gets from the server to the client through the networking system, its first stop is the HTML parser, which is composed of a few systems working together: encoding, pre-parsing, tokenization, and tree construction. The parser is the part of the construction project metaphor where we walk through all the raw materials: unpacking boxes; unbinding pallets, pipes, wiring, etc.; and pouring the foundation before handing off everything to the experts working on the framing, plumbing, electrical, etc.
The payload of an HTTP response body can be anything from HTML text to image data. The first job of the parser is to figure out how to interpret the bits just received from the server. Assuming we’re processing an HTML document, the decoder must figure out how the text document was translated into bits in order to reverse the process.
(Remember that ultimately even text must be translated to binary in the computer. Encoding—in this case ASCII encoding—defines that a binary value such as “01000100” means the letter “D,” as shown in the figure above.) Many possible encodings exist for text—it’s the browser’s job to figure out how to properly decode the text. The server should provide hints via
Content-Type headers, and the leading bits themselves can be analyzed (for a byte order mark, or BOM). If the encoding still cannot be determined, the browser can apply its best guess based on heuristics. Sometimes the only definitive answer comes from the (encoded) content itself in the form of a
<meta> html tag. Worst case scenario, the browser makes an educated guess and then later finds a contradicting
<meta> tag after parsing has started in earnest. In these rare cases, the parser must restart, throwing away the previously decoded content. Browsers sometimes have to deal with old web content (using legacy encodings), and a lot of these systems are in place to support that.
Once the encoding is known, the parser starts an initial pre-parsing step to scan the content with the goal of minimizing round-trip latency for additional resources. The pre-parser is not a full parser; for example, it doesn’t understand nesting levels or parent/child relationships in HTML. However, the pre-parser does recognize specific HTML tag names and attributes, as well as URLs. For example, if you have an
<img src="https://somewhere.example.com/images/dog.png" alt=""> somewhere in your HTML content, the pre-parser will notice the
src attribute, and queue a resource request for the dog picture via the networking system. The dog image is requested as quickly as possible, minimizing the time you need to wait for it to arrive from the network. The pre-parser may also notice certain explicit requests in the HTML such as preload and prefetch directives, and queue these up for processing as well.
Tokenization is the first half of parsing HTML. It involves turning the markup into individual tokens such as “begin tag,” “end tag,” “text run,” “comment,” and so forth, which are fed into the next state of the parser. The tokenizer is a state machine that transitions between the different states of the HTML language, such as “in tag open state” (
<|video controls>), “in attribute name state” (
<video con|trols>), and “after attribute name state” (
<video controls|>), doing so iteratively as each character in the HTML markup text document is read.
(In each of those example tags, the vertical pipe illustrates the tokenizer’s position.)
The HTML spec (see “12.2.5 Tokenization”) currently defines eighty separate states for the tokenizer. The tokenizer and parser are very adaptable: both can handle and convert any text content into an HTML document—even if code in the text is not valid HTML. Resiliency like this is one of the features that has made the web so approachable by developers of all skill levels. However, the drawback of the tokenizer and parser’s resilience is that you may not always get the results you expect, which can lead to some subtle programming bugs. (Checking your code in the HTML validator can help you avoid bugs like this.)
For those who prefer a more black-and-white approach to markup language correctness, browsers have an alternate parsing mechanism built in that treats any failure as a catastrophic failure (meaning any failure will cause the content to not render). This parsing mode uses the rules of XML to process HTML, and can be enabled by sending the document to the browser with the “application/xhtml+xml” MIME type (or any XML-based MIME type that uses elements in the HTML namespace).
Browsers may combine the pre-parser and tokenization steps together as an optimization.
The browser needs an internal (in-memory) representation of a web page, and, in the DOM standard, web standards define exactly what shape that representation should be. The parser’s responsibility is to take the tokens created by the tokenizer in the previous step, and create and insert the objects into the Document Object Model (DOM) in the appropriate way (specifically using the twenty-three separate states of its state machine; see “22.214.171.124 The rules for parsing tokens in HTML content”). The DOM is organized into a tree data structure, so this process is sometimes referred to as tree construction. (As an aside, Internet Explorer did not use a tree structure for much of its history.)
HTML parsing is complicated by the variety of error-handling cases that ensure that legacy HTML content on the web continues to have compatible structure in today’s modern browsers. For example, many HTML tags have implied end tags, meaning that if you don’t provide them, the browser auto-closes the matching tag for you. Consider, for instance, this HTML:
The parser has a rule that will create an implied end tag for the paragraph, like so:
This ensures the two paragraph objects in the resulting tree are siblings, as opposed to one paragraph object by ignoring the second open tag. HTML tables are perhaps the most complicated where the parser’s rules attempt to ensure that tables have the proper structure.
<video> tag). The rendering system becomes responsible for figuring out how to deal with any weird inconsistencies like that.
<script> tags contain text that the parser must collect and then send to a scripting engine for evaluation. While the script engine parses and evaluates the script text, the parser waits. If the script evaluation includes invoking the
document.write API, a second instance of the HTML parser must start running (reentrantly). To quickly revisit our construction metaphor,
document.write require stopping all in-progress work to go back to the store to get some additional materials that we hadn’t realized we needed. While we’re away at the store, all progress on the construction is stalled.
All of these complications make writing a compliant HTML parser a non-trivial undertaking.
When the parser finishes, it announces its completion via an event called
DOMContentLoaded, there are a variety of events that signal significant state changes in the web page such as
load (meaning parsing is done, and all the resources requested by the parser, like images, CSS, video, etc., have been downloaded) and
unload (meaning the web page is about to be closed). Many events are specific to user input, such as the user touching the screen (
pointerup, and others), using a mouse (
mousemove, and others), or typing on the keyboard (
The tree structure of the DOM makes it convenient to “filter” how frequently code responds to an event by allowing events to be listened for at any level in the tree (i.e.., at the root of the tree, in the leaves of the tree, or anywhere in between). The browser first determines where to fire the event in the tree (meaning which DOM object, such as a specific
<input> control), and then calculates a route for the event starting from the root of the tree, then down each branch until it reaches the target (the
<input> for example), and then back along the same path to the root. Each object along the route then has its event listeners triggered, so that listeners at the root of the tree will “see” more events than specific listeners at the leaves of the tree.
Some events can also be canceled, which provides, for example, the ability to stop a form submission if the form isn’t filled out properly. (A
submit event is fired from a
The HTML language provides a rich feature set that extends far beyond the markup that the parser processes. The parser builds the structure of which elements contain other elements and what state those elements have initially (their attributes). The combination of the structure and state is enough to provide both a basic rendering and some interactivity (such as through built-in controls like
As the parser is constructing objects to put into the tree, it looks up the element’s name (and namespace) and finds a matching HTML interface to wrap around the object.
Interfaces add features to basic HTML elements that are specific to their kind or type of element. Some generic features include:
- access to HTML collections representing all or a subset of the element’s children;
- the ability to search the element’s attributes, children, and parent elements;
- and importantly, ways to create new elements (without using the parser), and attach them to (or detach them from) the tree.
For specific elements like
<table>, the interface contains additional table-specific features for locating all the rows, columns, and cells within the table, as well as shortcuts for removing and adding rows and cells from and to the table. Likewise,
Any DOM changes made to the tree via the APIs described above (such as the hierarchical position of an element in the tree, the element’s state by toggling an attribute name or value, or any of the API actions from an element’s interface) after parsing ends will trigger a chain-reaction of browser systems whose job is to analyze the change and update what you see on the screen as soon as possible. The tree maintains many optimizations for making these repeated updates fast and efficient, such as:
- representing common element names and attributes via a number (using hash tables for fast identification);
- collection caches that remember an element’s frequently-visited children (for fast child-element iteration);
- and sub-tree change-tracking to minimize what parts of the whole tree get “dirty” (and will need to be re-validated).
The HTML elements and their interfaces in the DOM are the browser’s only mechanism for showing content on the screen. CSS can affect layout, but only for content that exists in HTML elements. Ultimately, if you want to see content on screen, it must be done through HTML interfaces that are part of the tree.” (For those wondering about Scalable Vector Graphics (SVG) and MathML languages—those elements must also be added to the tree to be seen—I’ve skipped them for brevity.)
We learned how the parser is one way of getting HTML from the server into the DOM tree, and how element interfaces in the DOM can be used to add, remove, and modify that tree after the fact. Yet, the browser’s programmable DOM is quite vast and not scoped to just HTML element interfaces.
The scope of the browser’s DOM is comparable to the set of features that apps can use in any operating system. Things like (but not limited to):
- access to storage systems (databases, key/value storage, network cache storage);
- devices (geolocation, proximity and orientation sensors of various types, USB, MIDI, Bluetooth, Gamepads);
- the network (HTTP exchanges, bidirectional server sockets, real-time media streaming);
- graphics (2D and 3D graphics primitives, shaders, virtual and augmented reality);
- and multithreading (shared and dedicated execution environments with rich message passing capabilities).
The capabilities exposed by the DOM continue to grow as new web standards are developed and implemented by major browser engines. Most of these “extra” APIs of the DOM are out of scope for this article, however.
Moving on from markup
In this segment, you’ve learned how parsing and tree construction create the foundation for the DOM: the stateful, in-memory representation of the HTML tags received from the network.
With the DOM model in place, services such as the event model and element APIs enable web developers to change the DOM structure at any time. Each change begins a sequence of “re-building” work of which updating the DOM is only the first step.
Going back to the construction analogy, the on-site raw materials have been formed into the structural framing of the building and built to the right dimensions with internal plumbing, electrical, and other services installed, but with no real sense yet of the building’s final look—its exterior and interior design.
In the next installment, we’ll cover how the browser takes the DOM tree as input to a layout engine that incorporates CSS and transforms the tree into something you can finally see on the screen.