Indexing the Web—It’s Not Just Google’s Business

by Lyle MullicanJune 09, 2009

Interface responsiveness is one of many details web developers must consider in their quest to deliver a good user experience. An application that responds quickly enhances the user’s sense of control. In working to maximize application speed, though, it’s easy to look in the wrong places. We optimize images and try to reduce page sizes. We compare the performance of web server software, programming languages, frameworks, and hardware, even though the differences in those tools may be minimal.

Article Continues Below

There’s another, often-overlooked element, however, that can affect performance more than almost anything else: database design. When a database lacks indices on the right columns, speed issues are sure to follow, slowly eroding the user experience as the volume of data increases. Fortunately, the problem is easily addressed.

Web databases do much more than passively store information. Part of their power comes from indexing records efficiently. An index serves as a map, identifying the precise location of a small piece of data in a much larger pile. For example, when I search for “web development,” Google identifies two hundred million results and displays the first ten—in a quarter of a second. But Google isn’t loading every one of those pages and scanning their contents when I perform my search: they’ve analyzed the pages ahead of time and matched my search terms against an index that only references the original content.

Does it really matter?#section2

Yes! In one simple test case, missing indices caused an application to respond 20 to 60 times slower than it should. Let’s take a basic blogging application as an example. We’ll create a few tables and populate them with some randomly generated data:

articles

articles_categories

categories

comments

users

Imagine that our blog is relatively new. There’s only a single author, ten articles, and five comments on each article. Our database does contain indices, but only on the primary key (ID) columns for these tables.

First, let’s do a simple query to find all the articles by a particular author, using his e-mail address as the search term.

SELECT * FROM articles 
INNER JOIN users
  ON articles.user_id = users.id
WHERE users.email = 'john.doe@example.com';
    
0.01 seconds

Not surprisingly, this query runs very quickly. After all, there’s only one author, so it doesn’t really matter that our search term (the e-mail address) isn’t in an index.

Let’s take a more complex example. This query finds all article comments for a particular author, including data about which categories the article belongs to:

SELECT * FROM articles
INNER JOIN articles_categories
  ON articles_categories.article_id = articles.id
INNER JOIN categories
  ON articles_categories.category_id = categories.id
INNER JOIN users
  ON articles.user_id = users.id
INNER JOIN comments
  ON comments.article_id = articles.id
WHERE users.email = 'john.doe@example.com';0.02 seconds

Again, the query takes almost no time at all. But in reality, it’s performing a very resource-intensive operation called a full-table scan to get the results. The response is quick only because our data is so limited. Consider what we’re asking the database to do:

Identify a user whose e-mail address is john.doe@example.com.

Find every article whose user_id matches the id value for that user.

Find every category whose ID is listed in the articles_categories table alongside an article_id from the list of articles we’ve identified.

Finally, locate every comment whose article_id also matches that list of articles.

Not one of those steps actually looks up a record by its own ID—only by the IDs of other linked records. Since only the ID columns are indexed, our database engine must examine every row in at least some of these tables to complete the search. Fast-forward a couple of years to expand the scenario for our blog: we have 1,000 articles, 15 contributing authors, and an average of 25 comments per article. Let’s repeat our simple query:

SELECT * FROM articles 
INNER JOIN users
  ON articles.user_id = users.id
WHERE users.email = 'john.doe@example.com';0.65 seconds

The simple query still finishes in under a second, but the change in response time is significant. The delay makes an application start to feel sluggish and undermines the user experience. Even more dramatic, though, is the difference additional data makes to the second query:

SELECT * FROM articles
INNER JOIN articles_categories
  ON articles_categories.article_id = articles.id
INNER JOIN categories
  ON articles_categories.category_id = categories.id
INNER JOIN users
  ON articles.user_id = users.id
INNER JOIN comments
  ON comments.article_id = articles.id
WHERE users.email = 'john.doe@example.com';6.69 seconds

Now we’re in dangerous territory. Performed routinely, such long-running queries could hamstring an application. The interface slows down. Server processes stack up, waiting for queries to finish. Browsers may time out waiting for data. Since the database and the web server accept a limited number of simultaneous connections, fewer connections are available for incoming requests while processes wait for a query.

Here’s what happens when we add indices to our tables:

Simple query: 0.01 seconds
Complex query: 0.32 seconds

Note that the simple query finishes just as quickly with 1,000 articles as it did when there were only 15. The complex query isn’t quite as fast, but it’s 20 times faster than before. Without indices, we were scanning every row of some tables, and the response time was directly related to the amount of data being scanned. When we index the specific columns we’re searching, the process becomes far more efficient. Think of it this way—it’s like looking up the word “poultry” in the back of the cookbook instead of flipping through each page and putting a marker on all the chicken dishes. Using indices saves time. The more data searched, the more time is saved by indexing.

What to index#section3

In general, place an index on all foreign keys in a database. Like the user_id column in the articles table, a foreign key is a column that references the ID (or primary key) of another table, linking records across tables. Less obviously, indices should also be applied to any column used to limit a query that searches a large number of records. In our blogging example, we might want to index the email column of our users table, because we often need to identify users by their e-mail addresses. If we know the application will never have more than a dozen users, it won’t make much difference, but if we expect the application’s user base to grow, it could be crucial. Every time a user signs in, we might have a query like this:

SELECT password FROM users WHERE email = '$my_email';

Running this query on a table with thousands of records could be problematic without an index on the email column. To create it in MySQL, use the following command:

ALTER TABLE users ADD INDEX (email);

Most graphical database management tools (phpMyAdmin, for example) offer built-in controls for creating and managing indices. An index can also reference multiple columns together, which is useful when records are often identified by a combination of attributes. To make the following query more efficient, we could index both the email and password columns jointly:

SELECT * FROM users WHERE email = '$my_email' 
  AND password = '$my_password';

Know your tools#section4

Understanding indices is particularly important when we rely on frameworks to write SQL for us. Frameworks are useful for a number of reasons, but we need to be careful. Installing Dreamweaver doesn’t obviate the need to understand XHTML and CSS, and building tables with Ruby on Rails migration scripts doesn’t eliminate the need to index them. And while a mature platform such as WordPress will create well-indexed tables, the plugins we install for it may not. However, it’s also important not to go overboard. Building too many indices can also be problematic, because the server spends time determining which one(s) to use to satisfy a given query. The database must also update these indices when new records are added. As in any aspect of design, create what’s needed to solve the problem—nothing more or less.

The next level#section5

Indexing database tables is an easy way to boost performance, and in many cases offers huge benefits, but it doesn’t solve every problem. Although optimizing queries for performance is a broad topic, there are a few general guidelines that may help.

Identify the problem child#section6

Finding performance bottlenecks can be difficult. A single rendered page might involve several database queries, and responsiveness may vary depending on the specific data being loaded. Fortunately, many database servers will do the heavy lifting and create a log of slow queries. In MySQL, add the following lines to the server configuration file to create a running log of any query that takes longer than a second to run:

long_query_time  = 1
log-slow-queries = /var/log/mysqld.slow.log

A line like the following one in the resulting log would reveal an opportunity for better indexing:

# Query_time: 7  Lock_time: 0  Rows_sent: 296  
Rows_examined: 75872

This means that the database server examined over 75,000 rows to identify fewer than 300 results. That kind of disproportion usually indicates that full-table scans are happening.

Ask the server to explain itself#section7

Most database servers offer a way to see the execution plan for a query—in other words, how the database thinks through the task we set for it. In MySQL, this is as simple as putting the word EXPLAIN in front of a query, which we can copy and paste from the log. For example:

EXPLAIN
SELECT * FROM articles
INNER JOIN articles_categories
  ON articles_categories.article_id = articles.id
INNER JOIN categories
  ON articles_categories.category_id = categories.id
INNER JOIN users
  ON articles.user_id = users.id
INNER JOIN comments
  ON comments.article_id = articles.id
WHERE users.email = 'john.doe@example.com';

In the original scenario with limited indices, the result might look something like this (omitting several columns for clarity):

table	type	possible_keys	key	rows	Extra
articles_categories	ALL	NULL	NULL	1000
categories	eq_ref	PRIMARY	PRIMARY	1
articles	eq_ref	PRIMARY	PRIMARY	1
users	eq_ref	PRIMARY	PRIMARY	1	Using where
comments	ALL	NULL	NULL	25000	Using where

This tells us a lot about how the database server handles each table included in the query. The type column shows how rows are matched when tables are joined together. Tables marked ALL indicate that a full-table scan is used. Columns labeled possible_keys and key list some indices the server considered potentially useful to satisfy the query, and which, if any, are chosen. The rows column shows how many rows the server thinks it will need to examine. The total number of potential results is the product of all values in that column (not the sum). In other words, for this query, the server anticipates processing up to 25 million combinations of rows—though the actual number will depend on what the real data looks like.

The Extra column contains other hints about how the database processes information. In this case, it shows which data sets are limited by a WHERE clause. With more complex queries, especially those that involve grouping or sorting operations, we could look for red flags such as the phrases Using filesort or Using temporary.

Compare the output of EXPLAIN for the indexed version of our database:

table	type	possible_keys	key	rows	Extra
users	ref	PRIMARY, email	email	1	Using where
articles	ref	PRIMARY, user_id	user_id	67
articles_categories	ref	category_id, joint_index, article_id	article_id	1
categories	eq_ref	PRIMARY	PRIMARY	1
comments	ref	article_id	article_id	25	Using where

Notice that not only are the indices used, the order of operations is entirely different. The database server is smart enough to do the best it can with what it’s given, so in the first instance, it developed a plan to work as efficiently as possible without indices. Now that they’re available, the entire approach is different. Multiplying the values in the rows column now gives us only 1,675 potential results—a tiny fraction of the original set.

Fine joinery#section8

Getting data from multiple tables means joining them together based on columns that link records. In the sample queries above, we specified an INNER JOIN to select only rows matching all the specified conditions. Sometimes it’s helpful to use other join types, such as a LEFT JOIN, where data can be returned even when there’s no matching row in one of the tables. In general, though, this second approach requires more work by the database.

For example, if we wanted to locate all the articles by a particular author and include category information, we might use a query like this with inner joins:

SELECT * FROM articles 
INNER JOIN users
  ON articles.user_id = users.id
INNER JOIN articles_categories
  ON articles_categories.article_id = articles.id
INNER JOIN categories 
  ON articles_categories.category_id = categories.id 
WHERE users.email = 'john.doe@example.com';

As long as our data integrity is strong, this will work well. But what if the application interface doesn’t require an author to list a category when publishing an article? In that case, this query might not give us all the articles we’re looking for by that author. Any articles that aren’t categorized will be left out of the results, because an inner join requires matching data in both tables. We could rewrite the query as follows:

SELECT * FROM articles 
INNER JOIN users
  ON articles.user_id = users.id
LEFT JOIN articles_categories
  ON articles_categories.article_id = articles.id
LEFT JOIN categories 
  ON articles_categories.category_id = categories.id 
WHERE users.email = 'john.doe@example.com';

This version will catch all the articles by the author in question, simply substituting NULL values for missing category data. But it’s a much more intensive operation. In our sample database, the query using inner joins takes 0.06 seconds to complete, compared to 0.29 seconds for the left joins—nearly five times faster. Because they still include everything that’s returned by an inner join, left joins are sometimes used when they aren’t really needed.

Use only what you need#section9

For simplicity, all the sample queries above used SELECT * to load data, meaning we asked the database to return all the data for the rows we matched. This is almost certainly far more data than we actually need. This is particularly problematic with large text columns, such as blog posts in our articles table. In the second sample query, we wanted to retrieve all the article comments for a particular author. The way the query is written, we’ll actually load the complete contents of the article (along with a lot of other data) not just once per article, but once per comment. So, if an article has 300 comments, the database will load the full text of that article 300 times. It would be much faster to select only the columns we’re interested in, like this:

SELECT comments.* FROM articles
INNER JOIN [...]

In fact, making this change further speeds our earlier sample query by a factor of four, bringing it down to a respectable 0.08 seconds.

The basics are usually good enough#section10

Fortunately, the easy changes often provide the greatest payoff. Deep analysis is usually only required for serious tuning. Following a handful of best practices in database design will improve speed and efficiency for most web applications.

Bear in mind that particular database servers, such as MySQL, PostgreSQL, or Microsoft SQL Server, differ in the way they execute queries and implement indices. It’s best to refer to your platform documentation for details, but the concepts remain the same in most cases. All the examples cited above were performed with MySQL 5.0. It’s also worth noting that there are different subtypes of index with specific properties—like forcing values to be unique or enabling full-text search.

It’s easy to blame sluggish performance on hardware or application code when all too often, the culprit lurks at the database level. If they’re implemented early, before responsiveness becomes a problem, a few small steps can make a huge improvement in your users’ experience.

22 Reader Comments

welovenicethings says:

June 9, 2009 at 11:01 am

This is a fantastic article to see on ALA. In one of my past lifes I carried out a lot of performance testing and tuning web applications. If developers can make sure these simple things are done when the application is coded it will make everyone lives easier in the long term.

The difference to the user experiance will also be dramatic, the first time I ran a performanced tuned application I could not believe the difference it made.

Phil
AdriaanNel says:

June 9, 2009 at 11:26 am

I really enjoyed this article and could impliment a couple of the points listed – especially indexes on varchar type fields, which already shows some speed increases in one of my sites.
Christophe BENOIT says:

June 9, 2009 at 11:35 am

I really appreciate this article because it was a post more technical than usual. Most of the time, we discuss size reduction (removing exif on pictures, better compressions), number of files downloaded (css sprites, css group), cache and http issues so theses advices are welcome !
Charlie Clark says:

June 9, 2009 at 1:17 pm

Webbies failing to think seriously about databases has to be one of the major problems of web application development. Unfortunately “toys” (and dangerous ones if incorrectly used) like MySQL and the associated literature don’t help. Articles like this can help to change this. Anyone starting out and looking for a free DBMS might one a look at PostgreSQL as it automatically creates indexes for any foreign keys that a table references, so you get performance + referential integrity. Add bound parameters to the client and you give the database chance to cache the query and you’re protected against SQL injection. Happy days!
carolinecblaker says:

June 9, 2009 at 3:05 pm

So let me get this straight – having the same fields in the database yet declaring a few select ones as ‘index’ changes all of this? Really??

By the way, the site == great in expressionengine!
Michael Newton says:

June 9, 2009 at 5:38 pm

I believe MySQL automatically creates an index when you create a foreign key constraint, which makes this advice even easier to follow, assuming you’re using a storage engine with foreign keys.

Another related topic that could help out the same audience is an introduction to normalization…
Lyle Mullican says:

June 9, 2009 at 6:35 pm

Michael, yes, I believe constrained foreign keys are indexed automatically. However, in MySQL the default storage engine in MyISAM, which doesn’t support constraints (as InnoDB does). In many applications, the relationships are only conceptual, which is why it’s so easy to overlook these things.
forex says:

June 9, 2009 at 9:17 pm

Thanks for a very informative article. The little things that make a huge difference in indexing the web… And they do make a difference.
BjÃ¸rn Enki says:

June 9, 2009 at 10:29 pm

Thanks Lyle, great explanations and very informative!
darsh39 says:

June 10, 2009 at 4:41 am

Yes this id true i found a huge difference in indexing the web.
This article help you to indexing a web…thanx for giving this type of articles which help other…Thanx Again
magicman24 says:

June 14, 2009 at 10:14 am

that’s a really interesting article. I remember the days where it took about 30 minutes to index a page 🙂

Mike Hersh
imblogging.com
elearningflash says:

June 14, 2009 at 8:31 pm

Sounds like building a good website is like building a good building
If the foundations are stable, it will last for a good while.
Maneet Puri says:

June 16, 2009 at 11:28 am

Great article. You have really illustrated the key things that are needed for indexing the web. In fact, these are the most important but oft-forgotten tweaks!

Thanks for sharing!
unbrandedc says:

June 16, 2009 at 7:56 pm

“Think of it this way—it’s like looking up the word “poultry” in the back of the cookbook instead of flipping through each page and putting a marker on all the chicken dishes.”

That summarizes the entire article in a nutshell. Good advice on how to keep databases running smoothly and efficiently, thanks.
stuartmarsh says:

June 18, 2009 at 6:43 pm

It was interesting that you mentioned frameworks. I use Django, I just checked the MySQL tables it creates, and it seems to index the foreign keys.

But it’s a lesson learned for me, as I’ve never really looked at database optimisation before. Time to start I feel!

Thanks
shapiawebdesigns says:

June 19, 2009 at 7:25 am

It is not only me most of the webmasters and SEO experts would be thinking that indexing is Google’s job. But now I got what to do to make the search engines index the pages of the websites.
JohnShuffle says:

June 20, 2009 at 2:57 pm

Reading the acticle I hate to say that Google dictates our behavior forcing us to pay 24/7 attention if you do website promotion
Justen Robertson says:

June 25, 2009 at 10:20 am

Database optimization is an often-overlooked aspect of web development. Many of the projects I’ve worked on pay absolutely no attention to it – to the extent that the only index on a given table is a primary key. Optimizations can have vast impacts on server performance; I recently built an ajax-based real time chat app that saw performance gains of about 60 times with some indexing and SELECT query optimizations. The “SELECT … USING” syntax is a godsend for hinting the database to use an indexed column in a multi-column select condition.
Montmorency says:

July 10, 2009 at 6:14 am

Here is the translation of the article: http://interpretor.ru/sql_indexing

Thanks for the great article!
flash bannere says:

November 4, 2010 at 3:19 pm

Interesting article! And this commentfield is really well designed 🙂
webdesign center says:

June 10, 2011 at 6:48 am

Great reading, very good article. Thank you
1smykke.dk says:

June 11, 2011 at 6:39 pm

Helped me realize some very good points.

Got something to say?

We have turned off comments, but you can see what folks had to say before we did so.

More from ALA

Good designers, bad websites: a proposal

by Alan Dalton

Designers are good people. Some designs exclude people anyway. Alan Dalton offers a practical fix: accessibility personas that help you recognize problems while you're designing, not after. Homework included.

“Successful” or “Unsuccessful”: the Post-“Good Design” Vocabulary

by Justin Dauer

Design for Amiability: Lessons from Vienna

by Mark Bernstein

Computing was born in a Viennese café. Between 1928 and 1934, while Hitler plotted and Europe crumbled, a motley crew of mathematicians, philosophers, architects, and economists gathered weekly to puzzle out the limits of reason—and invented Computer Science in the process. What made their collaboration possible wasn't just brilliance (though they had plenty). It was amiability: the careful design of a social space where difficult people could disagree without destroying each other. Longtime A List Apart contributing author Mark Bernstein mines this forgotten history for lessons that might just save today's embattled web from its worst impulses. Spoiler: it involves better coffee service and the looming threat of public humiliation.

Design Dialects: Breaking the Rules, Not the System

by Michel Ferreira

Design systems aren't component libraries—they’re living languages. Rigid adherence to visual rules creates brittle systems that break under contextual pressure. Fluent systems bend without breaking.

An Holistic Framework for Shared Design Leadership