A List Apart

Menu
Indexing the Web—It's Not Just Google's Business Issue № 285

Indexing the Web—It’s Not Just Google’s Business

by Published in The Server Side, Usability22 Comments

Interface responsiveness is one of many details web developers must consider in their quest to deliver a good user experience. An application that responds quickly enhances the user’s sense of control. In working to maximize application speed, though, it’s easy to look in the wrong places. We optimize images and try to reduce page sizes. We compare the performance of web server software, programming languages, frameworks, and hardware, even though the differences in those tools may be minimal.

There’s another, often-overlooked element, however, that can affect performance more than almost anything else: database design. When a database lacks indices on the right columns, speed issues are sure to follow, slowly eroding the user experience as the volume of data increases. Fortunately, the problem is easily addressed.

Web databases do much more than passively store information. Part of their power comes from indexing records efficiently. An index serves as a map, identifying the precise location of a small piece of data in a much larger pile. For example, when I search for “web development,” Google identifies two hundred million results and displays the first ten—in a quarter of a second. But Google isn’t loading every one of those pages and scanning their contents when I perform my search: they’ve analyzed the pages ahead of time and matched my search terms against an index that only references the original content.

Does it really matter?

Yes! In one simple test case, missing indices caused an application to respond 20 to 60 times slower than it should. Let’s take a basic blogging application as an example. We’ll create a few tables and populate them with some randomly generated data:

        
  • articles
  •     
  • articles_categories
  •     
  • categories
  •     
  • comments
  •     
  • users
  •    

Imagine that our blog is relatively new. There’s only a single author, ten articles, and five comments on each article. Our database does contain indices, but only on the primary key (ID) columns for these tables.

First, let’s do a simple query to find all the articles by a particular author, using his e-mail address as the search term.

SELECT * FROM articles 
INNER JOIN users
  ON articles.user_id = users.id
WHERE users.email = 'john.doe@example.com';
    
0.01 seconds

Not surprisingly, this query runs very quickly. After all, there’s only one author, so it doesn’t really matter that our search term (the e-mail address) isn’t in an index.

Let’s take a more complex example. This query finds all article comments for a particular author, including data about which categories the article belongs to:

SELECT * FROM articles
INNER JOIN articles_categories
  ON articles_categories.article_id = articles.id
INNER JOIN categories
  ON articles_categories.category_id = categories.id
INNER JOIN users
  ON articles.user_id = users.id
INNER JOIN comments
  ON comments.article_id = articles.id
WHERE users.email = 'john.doe@example.com';0.02 seconds

Again, the query takes almost no time at all. But in reality, it’s performing a very resource-intensive operation called a full-table scan to get the results. The response is quick only because our data is so limited. Consider what we’re asking the database to do:

        
  1. Identify a user whose e-mail address is john.doe@example.com.
  2.     
  3. Find every article whose user_id matches the id value for that user.
  4.     
  5. Find every category whose ID is listed in the articles_categories table alongside an article_id from the list of articles we’ve identified.
  6.     
  7. Finally, locate every comment whose article_id also matches that list of articles.
  8.    

Not one of those steps actually looks up a record by its own ID—only by the IDs of other linked records. Since only the ID columns are indexed, our database engine must examine every row in at least some of these tables to complete the search. Fast-forward a couple of years to expand the scenario for our blog: we have 1,000 articles, 15 contributing authors, and an average of 25 comments per article. Let’s repeat our simple query:

SELECT * FROM articles 
INNER JOIN users
  ON articles.user_id = users.id
WHERE users.email = 'john.doe@example.com';0.65 seconds

The simple query still finishes in under a second, but the change in response time is significant. The delay makes an application start to feel sluggish and undermines the user experience. Even more dramatic, though, is the difference additional data makes to the second query:

SELECT * FROM articles
INNER JOIN articles_categories
  ON articles_categories.article_id = articles.id
INNER JOIN categories
  ON articles_categories.category_id = categories.id
INNER JOIN users
  ON articles.user_id = users.id
INNER JOIN comments
  ON comments.article_id = articles.id
WHERE users.email = 'john.doe@example.com';6.69 seconds

Now we’re in dangerous territory. Performed routinely, such long-running queries could hamstring an application. The interface slows down. Server processes stack up, waiting for queries to finish. Browsers may time out waiting for data. Since the database and the web server accept a limited number of simultaneous connections, fewer connections are available for incoming requests while processes wait for a query.

Here’s what happens when we add indices to our tables:

Simple query: 0.01 seconds
       Complex query: 0.32 seconds

Note that the simple query finishes just as quickly with 1,000 articles as it did when there were only 15. The complex query isn’t quite as fast, but it’s 20 times faster than before. Without indices, we were scanning every row of some tables, and the response time was directly related to the amount of data being scanned. When we index the specific columns we’re searching, the process becomes far more efficient. Think of it this way—it’s like looking up the word “poultry” in the back of the cookbook instead of flipping through each page and putting a marker on all the chicken dishes. Using indices saves time. The more data searched, the more time is saved by indexing.

What to index

In general, place an index on all foreign keys in a database. Like the user_id column in the articles table, a foreign key is a column that references the ID (or primary key) of another table, linking records across tables. Less obviously, indices should also be applied to any column used to limit a query that searches a large number of records. In our blogging example, we might want to index the email column of our users table, because we often need to identify users by their e-mail addresses. If we know the application will never have more than a dozen users, it won’t make much difference, but if we expect the application’s user base to grow, it could be crucial. Every time a user signs in, we might have a query like this:

SELECT password FROM users WHERE email = '$my_email';

Running this query on a table with thousands of records could be problematic without an index on the email column. To create it in MySQL, use the following command:

ALTER TABLE users ADD INDEX (email);

Most graphical database management tools (phpMyAdmin, for example) offer built-in controls for creating and managing indices. An index can also reference multiple columns together, which is useful when records are often identified by a combination of attributes. To make the following query more efficient, we could index both the email and password columns jointly:

SELECT * FROM users WHERE email = '$my_email' 
  AND password = '$my_password';

Know your tools

Understanding indices is particularly important when we rely on frameworks to write SQL for us. Frameworks are useful for a number of reasons, but we need to be careful. Installing Dreamweaver doesn’t obviate the need to understand XHTML and CSS, and building tables with Ruby on Rails migration scripts doesn’t eliminate the need to index them. And while a mature platform such as Wordpress will create well-indexed tables, the plugins we install for it may not. However, it’s also important not to go overboard. Building too many indices can also be problematic, because the server spends time determining which one(s) to use to satisfy a given query. The database must also update these indices when new records are added. As in any aspect of design, create what’s needed to solve the problem—nothing more or less.

The next level

Indexing database tables is an easy way to boost performance, and in many cases offers huge benefits, but it doesn’t solve every problem. Although optimizing queries for performance is a broad topic, there are a few general guidelines that may help.

Identify the problem child

Finding performance bottlenecks can be difficult. A single rendered page might involve several database queries, and responsiveness may vary depending on the specific data being loaded. Fortunately, many database servers will do the heavy lifting and create a log of slow queries. In MySQL, add the following lines to the server configuration file to create a running log of any query that takes longer than a second to run:

long_query_time  = 1
log-slow-queries = /var/log/mysqld.slow.log

A line like the following one in the resulting log would reveal an opportunity for better indexing:

# Query_time: 7  Lock_time: 0  Rows_sent: 296  
Rows_examined: 75872

This means that the database server examined over 75,000 rows to identify fewer than 300 results. That kind of disproportion usually indicates that full-table scans are happening.

Ask the server to explain itself

Most database servers offer a way to see the execution plan for a query—in other words, how the database thinks through the task we set for it. In MySQL, this is as simple as putting the word EXPLAIN in front of a query, which we can copy and paste from the log. For example:

EXPLAIN
SELECT * FROM articles
INNER JOIN articles_categories
  ON articles_categories.article_id = articles.id
INNER JOIN categories
  ON articles_categories.category_id = categories.id
INNER JOIN users
  ON articles.user_id = users.id
INNER JOIN comments
  ON comments.article_id = articles.id
WHERE users.email = 'john.doe@example.com';

In the original scenario with limited indices, the result might look something like this (omitting several columns for clarity):

                                                                                                                                                                                                                                                                                                                                                                                 
tabletypepossible_keyskeyrowsExtra
articles_categoriesALLNULLNULL1000 
categorieseq_refPRIMARYPRIMARY1 
articleseq_refPRIMARYPRIMARY1 
userseq_refPRIMARYPRIMARY1Using where
commentsALLNULLNULL25000Using where

This tells us a lot about how the database server handles each table included in the query. The type column shows how rows are matched when tables are joined together. Tables marked ALL indicate that a full-table scan is used. Columns labeled possible_keys and key list some indices the server considered potentially useful to satisfy the query, and which, if any, are chosen. The rows column shows how many rows the server thinks it will need to examine. The total number of potential results is the product of all values in that column (not the sum). In other words, for this query, the server anticipates processing up to 25 million combinations of rows—though the actual number will depend on what the real data looks like.

The Extra column contains other hints about how the database processes information. In this case, it shows which data sets are limited by a WHERE clause. With more complex queries, especially those that involve grouping or sorting operations, we could look for red flags such as the phrases Using filesort or Using temporary.

Compare the output of EXPLAIN for the indexed version of our database:

                                                                                                                                                                                                                                                                                                                                                                                 
tabletypepossible_keyskeyrowsExtra
usersrefPRIMARY, emailemail1Using where
articlesrefPRIMARY, user_iduser_id67 
articles_categoriesrefcategory_id, joint_index, article_idarticle_id1 
categorieseq_refPRIMARYPRIMARY1 
commentsrefarticle_idarticle_id25Using where

Notice that not only are the indices used, the order of operations is entirely different. The database server is smart enough to do the best it can with what it’s given, so in the first instance, it developed a plan to work as efficiently as possible without indices. Now that they’re available, the entire approach is different. Multiplying the values in the rows column now gives us only 1,675 potential results—a tiny fraction of the original set.

Fine joinery

Getting data from multiple tables means joining them together based on columns that link records. In the sample queries above, we specified an INNER JOIN to select only rows matching all the specified conditions. Sometimes it’s helpful to use other join types, such as a LEFT JOIN, where data can be returned even when there’s no matching row in one of the tables. In general, though, this second approach requires more work by the database.

For example, if we wanted to locate all the articles by a particular author and include category information, we might use a query like this with inner joins:

SELECT * FROM articles 
INNER JOIN users
  ON articles.user_id = users.id
INNER JOIN articles_categories
  ON articles_categories.article_id = articles.id
INNER JOIN categories 
  ON articles_categories.category_id = categories.id 
WHERE users.email = 'john.doe@example.com';

As long as our data integrity is strong, this will work well. But what if the application interface doesn’t require an author to list a category when publishing an article? In that case, this query might not give us all the articles we’re looking for by that author. Any articles that aren’t categorized will be left out of the results, because an inner join requires matching data in both tables. We could rewrite the query as follows:

SELECT * FROM articles 
INNER JOIN users
  ON articles.user_id = users.id
LEFT JOIN articles_categories
  ON articles_categories.article_id = articles.id
LEFT JOIN categories 
  ON articles_categories.category_id = categories.id 
WHERE users.email = 'john.doe@example.com';

This version will catch all the articles by the author in question, simply substituting NULL values for missing category data. But it’s a much more intensive operation. In our sample database, the query using inner joins takes 0.06 seconds to complete, compared to 0.29 seconds for the left joins—nearly five times faster. Because they still include everything that’s returned by an inner join, left joins are sometimes used when they aren’t really needed.

Use only what you need

For simplicity, all the sample queries above used SELECT * to load data, meaning we asked the database to return all the data for the rows we matched. This is almost certainly far more data than we actually need. This is particularly problematic with large text columns, such as blog posts in our articles table. In the second sample query, we wanted to retrieve all the article comments for a particular author. The way the query is written, we’ll actually load the complete contents of the article (along with a lot of other data) not just once per article, but once per comment. So, if an article has 300 comments, the database will load the full text of that article 300 times. It would be much faster to select only the columns we’re interested in, like this:

SELECT comments.* FROM articles
INNER JOIN [...]

In fact, making this change further speeds our earlier sample query by a factor of four, bringing it down to a respectable 0.08 seconds.

The basics are usually good enough

Fortunately, the easy changes often provide the greatest payoff. Deep analysis is usually only required for serious tuning. Following a handful of best practices in database design will improve speed and efficiency for most web applications.

Bear in mind that particular database servers, such as MySQL, PostgreSQL, or Microsoft SQL Server, differ in the way they execute queries and implement indices. It’s best to refer to your platform documentation for details, but the concepts remain the same in most cases. All the examples cited above were performed with MySQL 5.0. It’s also worth noting that there are different subtypes of index with specific properties—like forcing values to be unique or enabling full-text search.

It’s easy to blame sluggish performance on hardware or application code when all too often, the culprit lurks at the database level. If they’re implemented early, before responsiveness becomes a problem, a few small steps can make a huge improvement in your users’ experience.

22 Reader Comments

Load Comments