Indexing the Web—It's Not Just Google's Business
Issue № 285

Indexing the Web—It’s Not Just Google’s Business

Interface responsiveness is one of many details web developers must consider in their quest to deliver a good user experience. An application that responds quickly enhances the user’s sense of control. In working to maximize application speed, though, it’s easy to look in the wrong places. We optimize images and try to reduce page sizes. We compare the performance of web server software, programming languages, frameworks, and hardware, even though the differences in those tools may be minimal.

Article Continues Below

There’s another, often-overlooked element, however, that can affect performance more than almost anything else: database design. When a database lacks indices on the right columns, speed issues are sure to follow, slowly eroding the user experience as the volume of data increases. Fortunately, the problem is easily addressed.

Web databases do much more than passively store information. Part of their power comes from indexing records efficiently. An index serves as a map, identifying the precise location of a small piece of data in a much larger pile. For example, when I search for “web development,” Google identifies two hundred million results and displays the first ten—in a quarter of a second. But Google isn’t loading every one of those pages and scanning their contents when I perform my search: they’ve analyzed the pages ahead of time and matched my search terms against an index that only references the original content.

Does it really matter?#section1

Yes! In one simple test case, missing indices caused an application to respond 20 to 60 times slower than it should. Let’s take a basic blogging application as an example. We’ll create a few tables and populate them with some randomly generated data:

        

  • articles
  •     

  • articles_categories
  •     

  • categories
  •     

  • comments
  •     

  • users
  •   

Imagine that our blog is relatively new. There’s only a single author, ten articles, and five comments on each article. Our database does contain indices, but only on the primary key (ID) columns for these tables.

First, let’s do a simple query to find all the articles by a particular author, using his e-mail address as the search term.

SELECT * FROM articles 
INNER JOIN users
  ON articles.user_id = users.id
WHERE users.email = 'john.doe@example.com';
    
0.01 seconds

Not surprisingly, this query runs very quickly. After all, there’s only one author, so it doesn’t really matter that our search term (the e-mail address) isn’t in an index.

Let’s take a more complex example. This query finds all article comments for a particular author, including data about which categories the article belongs to:

SELECT * FROM articles
INNER JOIN articles_categories
  ON articles_categories.article_id = articles.id
INNER JOIN categories
  ON articles_categories.category_id = categories.id
INNER JOIN users
  ON articles.user_id = users.id
INNER JOIN comments
  ON comments.article_id = articles.id
WHERE users.email = 'john.doe@example.com';0.02 seconds

Again, the query takes almost no time at all. But in reality, it’s performing a very resource-intensive operation called a full-table scan to get the results. The response is quick only because our data is so limited. Consider what we’re asking the database to do:

        

  1. Identify a user whose e-mail address is john.doe@example.com.
  2.     

  3. Find every article whose user_id matches the id value for that user.
  4.     

  5. Find every category whose ID is listed in the articles_categories table alongside an article_id from the list of articles we’ve identified.
  6.     

  7. Finally, locate every comment whose article_id also matches that list of articles.
  8.   

Not one of those steps actually looks up a record by its own ID—only by the IDs of other linked records. Since only the ID columns are indexed, our database engine must examine every row in at least some of these tables to complete the search. Fast-forward a couple of years to expand the scenario for our blog: we have 1,000 articles, 15 contributing authors, and an average of 25 comments per article. Let’s repeat our simple query:

SELECT * FROM articles 
INNER JOIN users
  ON articles.user_id = users.id
WHERE users.email = 'john.doe@example.com';0.65 seconds

The simple query still finishes in under a second, but the change in response time is significant. The delay makes an application start to feel sluggish and undermines the user experience. Even more dramatic, though, is the difference additional data makes to the second query:

SELECT * FROM articles
INNER JOIN articles_categories
  ON articles_categories.article_id = articles.id
INNER JOIN categories
  ON articles_categories.category_id = categories.id
INNER JOIN users
  ON articles.user_id = users.id
INNER JOIN comments
  ON comments.article_id = articles.id
WHERE users.email = 'john.doe@example.com';6.69 seconds

Now we’re in dangerous territory. Performed routinely, such long-running queries could hamstring an application. The interface slows down. Server processes stack up, waiting for queries to finish. Browsers may time out waiting for data. Since the database and the web server accept a limited number of simultaneous connections, fewer connections are available for incoming requests while processes wait for a query.

Here’s what happens when we add indices to our tables:

Simple query: 0.01 seconds
      Complex query: 0.32 seconds

Note that the simple query finishes just as quickly with 1,000 articles as it did when there were only 15. The complex query isn’t quite as fast, but it’s 20 times faster than before. Without indices, we were scanning every row of some tables, and the response time was directly related to the amount of data being scanned. When we index the specific columns we’re searching, the process becomes far more efficient. Think of it this way—it’s like looking up the word “poultry” in the back of the cookbook instead of flipping through each page and putting a marker on all the chicken dishes. Using indices saves time. The more data searched, the more time is saved by indexing.

What to index#section2

In general, place an index on all foreign keys in a database. Like the user_id column in the articles table, a foreign key is a column that references the ID (or primary key) of another table, linking records across tables. Less obviously, indices should also be applied to any column used to limit a query that searches a large number of records. In our blogging example, we might want to index the email column of our users table, because we often need to identify users by their e-mail addresses. If we know the application will never have more than a dozen users, it won’t make much difference, but if we expect the application’s user base to grow, it could be crucial. Every time a user signs in, we might have a query like this:

SELECT password FROM users WHERE email = '$my_email';

Running this query on a table with thousands of records could be problematic without an index on the email column. To create it in MySQL, use the following command:

ALTER TABLE users ADD INDEX (email);

Most graphical database management tools (phpMyAdmin, for example) offer built-in controls for creating and managing indices. An index can also reference multiple columns together, which is useful when records are often identified by a combination of attributes. To make the following query more efficient, we could index both the email and password columns jointly:

SELECT * FROM users WHERE email = '$my_email' 
  AND password = '$my_password';

Know your tools#section3

Understanding indices is particularly important when we rely on frameworks to write SQL for us. Frameworks are useful for a number of reasons, but we need to be careful. Installing Dreamweaver doesn’t obviate the need to understand XHTML and CSS, and building tables with Ruby on Rails migration scripts doesn’t eliminate the need to index them. And while a mature platform such as WordPress will create well-indexed tables, the plugins we install for it may not. However, it’s also important not to go overboard. Building too many indices can also be problematic, because the server spends time determining which one(s) to use to satisfy a given query. The database must also update these indices when new records are added. As in any aspect of design, create what’s needed to solve the problem—nothing more or less.

The next level#section4

Indexing database tables is an easy way to boost performance, and in many cases offers huge benefits, but it doesn’t solve every problem. Although optimizing queries for performance is a broad topic, there are a few general guidelines that may help.

Identify the problem child#section5

Finding performance bottlenecks can be difficult. A single rendered page might involve several database queries, and responsiveness may vary depending on the specific data being loaded. Fortunately, many database servers will do the heavy lifting and create a log of slow queries. In MySQL, add the following lines to the server configuration file to create a running log of any query that takes longer than a second to run:

long_query_time  = 1
log-slow-queries = /var/log/mysqld.slow.log

A line like the following one in the resulting log would reveal an opportunity for better indexing:

# Query_time: 7  Lock_time: 0  Rows_sent: 296  
Rows_examined: 75872

This means that the database server examined over 75,000 rows to identify fewer than 300 results. That kind of disproportion usually indicates that full-table scans are happening.

Ask the server to explain itself#section6

Most database servers offer a way to see the execution plan for a query—in other words, how the database thinks through the task we set for it. In MySQL, this is as simple as putting the word EXPLAIN in front of a query, which we can copy and paste from the log. For example:

EXPLAIN
SELECT * FROM articles
INNER JOIN articles_categories
  ON articles_categories.article_id = articles.id
INNER JOIN categories
  ON articles_categories.category_id = categories.id
INNER JOIN users
  ON articles.user_id = users.id
INNER JOIN comments
  ON comments.article_id = articles.id
WHERE users.email = 'john.doe@example.com';

In the original scenario with limited indices, the result might look something like this (omitting several columns for clarity):

 

     

       

       

       

       

       

       

     

 

     

       

       

       

       

       

       

     

     

       

       

       

       

       

       

     

     

       

       

       

       

       

       

     

     

       

       

       

       

       

       

     

     

       

       

       

       

       

       

     

  

table type possible_keys key rows Extra
articles_categories ALL NULL NULL 1000  
categories eq_ref PRIMARY PRIMARY 1  
articles eq_ref PRIMARY PRIMARY 1  
users eq_ref PRIMARY PRIMARY 1 Using where
comments ALL NULL NULL 25000 Using where

This tells us a lot about how the database server handles each table included in the query. The type column shows how rows are matched when tables are joined together. Tables marked ALL indicate that a full-table scan is used. Columns labeled possible_keys and key list some indices the server considered potentially useful to satisfy the query, and which, if any, are chosen. The rows column shows how many rows the server thinks it will need to examine. The total number of potential results is the product of all values in that column (not the sum). In other words, for this query, the server anticipates processing up to 25 million combinations of rows—though the actual number will depend on what the real data looks like.

The Extra column contains other hints about how the database processes information. In this case, it shows which data sets are limited by a WHERE clause. With more complex queries, especially those that involve grouping or sorting operations, we could look for red flags such as the phrases Using filesort or Using temporary.

Compare the output of EXPLAIN for the indexed version of our database:

 

     

       

       

       

       

       

       

     

 

     

       

       

       

       

       

       

     

     

       

       

       

       

       

       

     

     

       

       

       

       

       

       

     

     

       

       

       

       

       

       

     

     

       

       

       

       

       

       

     

  

table type possible_keys key rows Extra
users ref PRIMARY, email email 1 Using where
articles ref PRIMARY, user_id user_id 67  
articles_categories ref category_id, joint_index, article_id article_id 1  
categories eq_ref PRIMARY PRIMARY 1  
comments ref article_id article_id 25 Using where

Notice that not only are the indices used, the order of operations is entirely different. The database server is smart enough to do the best it can with what it’s given, so in the first instance, it developed a plan to work as efficiently as possible without indices. Now that they’re available, the entire approach is different. Multiplying the values in the rows column now gives us only 1,675 potential results—a tiny fraction of the original set.

Fine joinery#section7

Getting data from multiple tables means joining them together based on columns that link records. In the sample queries above, we specified an INNER JOIN to select only rows matching all the specified conditions. Sometimes it’s helpful to use other join types, such as a LEFT JOIN, where data can be returned even when there’s no matching row in one of the tables. In general, though, this second approach requires more work by the database.

For example, if we wanted to locate all the articles by a particular author and include category information, we might use a query like this with inner joins:

SELECT * FROM articles 
INNER JOIN users
  ON articles.user_id = users.id
INNER JOIN articles_categories
  ON articles_categories.article_id = articles.id
INNER JOIN categories 
  ON articles_categories.category_id = categories.id 
WHERE users.email = 'john.doe@example.com';

As long as our data integrity is strong, this will work well. But what if the application interface doesn’t require an author to list a category when publishing an article? In that case, this query might not give us all the articles we’re looking for by that author. Any articles that aren’t categorized will be left out of the results, because an inner join requires matching data in both tables. We could rewrite the query as follows:

SELECT * FROM articles 
INNER JOIN users
  ON articles.user_id = users.id
LEFT JOIN articles_categories
  ON articles_categories.article_id = articles.id
LEFT JOIN categories 
  ON articles_categories.category_id = categories.id 
WHERE users.email = 'john.doe@example.com';

This version will catch all the articles by the author in question, simply substituting NULL values for missing category data. But it’s a much more intensive operation. In our sample database, the query using inner joins takes 0.06 seconds to complete, compared to 0.29 seconds for the left joins—nearly five times faster. Because they still include everything that’s returned by an inner join, left joins are sometimes used when they aren’t really needed.

Use only what you need#section8

For simplicity, all the sample queries above used SELECT * to load data, meaning we asked the database to return all the data for the rows we matched. This is almost certainly far more data than we actually need. This is particularly problematic with large text columns, such as blog posts in our articles table. In the second sample query, we wanted to retrieve all the article comments for a particular author. The way the query is written, we’ll actually load the complete contents of the article (along with a lot of other data) not just once per article, but once per comment. So, if an article has 300 comments, the database will load the full text of that article 300 times. It would be much faster to select only the columns we’re interested in, like this:

SELECT comments.* FROM articles
INNER JOIN [...]

In fact, making this change further speeds our earlier sample query by a factor of four, bringing it down to a respectable 0.08 seconds.

The basics are usually good enough#section9

Fortunately, the easy changes often provide the greatest payoff. Deep analysis is usually only required for serious tuning. Following a handful of best practices in database design will improve speed and efficiency for most web applications.

Bear in mind that particular database servers, such as MySQL, PostgreSQL, or Microsoft SQL Server, differ in the way they execute queries and implement indices. It’s best to refer to your platform documentation for details, but the concepts remain the same in most cases. All the examples cited above were performed with MySQL 5.0. It’s also worth noting that there are different subtypes of index with specific properties—like forcing values to be unique or enabling full-text search.

It’s easy to blame sluggish performance on hardware or application code when all too often, the culprit lurks at the database level. If they’re implemented early, before responsiveness becomes a problem, a few small steps can make a huge improvement in your users’ experience.

22 Reader Comments

  1. This is a fantastic article to see on ALA. In one of my past lifes I carried out a lot of performance testing and tuning web applications. If developers can make sure these simple things are done when the application is coded it will make everyone lives easier in the long term.

    The difference to the user experiance will also be dramatic, the first time I ran a performanced tuned application I could not believe the difference it made.

    Phil

  2. I really enjoyed this article and could impliment a couple of the points listed – especially indexes on varchar type fields, which already shows some speed increases in one of my sites.

  3. I really appreciate this article because it was a post more technical than usual. Most of the time, we discuss size reduction (removing exif on pictures, better compressions), number of files downloaded (css sprites, css group), cache and http issues so theses advices are welcome !

  4. Webbies failing to think seriously about databases has to be one of the major problems of web application development. Unfortunately “toys” (and dangerous ones if incorrectly used) like MySQL and the associated literature don’t help. Articles like this can help to change this. Anyone starting out and looking for a free DBMS might one a look at PostgreSQL as it automatically creates indexes for any foreign keys that a table references, so you get performance + referential integrity. Add bound parameters to the client and you give the database chance to cache the query and you’re protected against SQL injection. Happy days!

  5. So let me get this straight – having the same fields in the database yet declaring a few select ones as ‘index’ changes all of this? Really??

    By the way, the site == great in expressionengine!

  6. I believe MySQL automatically creates an index when you create a foreign key constraint, which makes this advice even easier to follow, assuming you’re using a storage engine with foreign keys.

    Another related topic that could help out the same audience is an introduction to normalization…

  7. Michael, yes, I believe constrained foreign keys are indexed automatically. However, in MySQL the default storage engine in MyISAM, which doesn’t support constraints (as InnoDB does). In many applications, the relationships are only conceptual, which is why it’s so easy to overlook these things.

  8. Thanks for a very informative article. The little things that make a huge difference in indexing the web… And they do make a difference.

  9. Yes this id true i found a huge difference in indexing the web.
    This article help you to indexing a web…thanx for giving this type of articles which help other…Thanx Again

  10. Great article. You have really illustrated the key things that are needed for indexing the web. In fact, these are the most important but oft-forgotten tweaks!

    Thanks for sharing!

  11. “Think of it this way—it’s like looking up the word “poultry” in the back of the cookbook instead of flipping through each page and putting a marker on all the chicken dishes.”

    That summarizes the entire article in a nutshell. Good advice on how to keep databases running smoothly and efficiently, thanks.

  12. It was interesting that you mentioned frameworks. I use Django, I just checked the MySQL tables it creates, and it seems to index the foreign keys.

    But it’s a lesson learned for me, as I’ve never really looked at database optimisation before. Time to start I feel!

    Thanks

  13. It is not only me most of the webmasters and SEO experts would be thinking that indexing is Google’s job. But now I got what to do to make the search engines index the pages of the websites.

  14. Reading the acticle I hate to say that Google dictates our behavior forcing us to pay 24/7 attention if you do website promotion

  15. Database optimization is an often-overlooked aspect of web development. Many of the projects I’ve worked on pay absolutely no attention to it – to the extent that the only index on a given table is a primary key. Optimizations can have vast impacts on server performance; I recently built an ajax-based real time chat app that saw performance gains of about 60 times with some indexing and SELECT query optimizations. The “SELECT … USING” syntax is a godsend for hinting the database to use an indexed column in a multi-column select condition.

Got something to say?

We have turned off comments, but you can see what folks had to say before we did so.

More from ALA