Interface responsiveness is one of many details web developers must consider in their quest to deliver a good user experience. An application that responds quickly enhances the user’s sense of control. In working to maximize application speed, though, it’s easy to look in the wrong places. We optimize images and try to reduce page sizes. We compare the performance of web server software, programming languages, frameworks, and hardware, even though the differences in those tools may be minimal.
There’s another, often-overlooked element, however, that can affect performance more than almost anything else: database design. When a database lacks indices on the right columns, speed issues are sure to follow, slowly eroding the user experience as the volume of data increases. Fortunately, the problem is easily addressed.
Web databases do much more than passively store information. Part of their power comes from indexing records efficiently. An index serves as a map, identifying the precise location of a small piece of data in a much larger pile. For example, when I search for “web development,” Google identifies two hundred million results and displays the first ten—in a quarter of a second. But Google isn’t loading every one of those pages and scanning their contents when I perform my search: they’ve analyzed the pages ahead of time and matched my search terms against an index that only references the original content.
Does it really matter?
Yes! In one simple test case, missing indices caused an application to respond 20 to 60 times slower than it should. Let’s take a basic blogging application as an example. We’ll create a few tables and populate them with some randomly generated data:
Imagine that our blog is relatively new. There’s only a single author, ten articles, and five comments on each article. Our database does contain indices, but only on the primary key (ID) columns for these tables.
First, let’s do a simple query to find all the articles by a particular author, using his e-mail address as the search term.
SELECT * FROM articles INNER JOIN users ON articles.user_id = users.id WHERE users.email = 'firstname.lastname@example.org'; 0.01 seconds
Not surprisingly, this query runs very quickly. After all, there’s only one author, so it doesn’t really matter that our search term (the e-mail address) isn’t in an index.
Let’s take a more complex example. This query finds all article comments for a particular author, including data about which categories the article belongs to:
SELECT * FROM articles INNER JOIN articles_categories ON articles_categories.article_id = articles.id INNER JOIN categories ON articles_categories.category_id = categories.id INNER JOIN users ON articles.user_id = users.id INNER JOIN comments ON comments.article_id = articles.id WHERE users.email = 'email@example.com';0.02 seconds
Again, the query takes almost no time at all. But in reality, it’s performing a very resource-intensive operation called a full-table scan to get the results. The response is quick only because our data is so limited. Consider what we’re asking the database to do:
- Identify a user whose e-mail address is firstname.lastname@example.org.
- Find every article whose
idvalue for that user.
- Find every category whose ID is listed in the
articles_categoriestable alongside an
article_idfrom the list of articles we’ve identified.
- Finally, locate every comment whose
article_idalso matches that list of articles.
Not one of those steps actually looks up a record by its own ID—only by the IDs of other linked records. Since only the ID columns are indexed, our database engine must examine every row in at least some of these tables to complete the search. Fast-forward a couple of years to expand the scenario for our blog: we have 1,000 articles, 15 contributing authors, and an average of 25 comments per article. Let’s repeat our simple query:
SELECT * FROM articles INNER JOIN users ON articles.user_id = users.id WHERE users.email = 'email@example.com';0.65 seconds
The simple query still finishes in under a second, but the change in response time is significant. The delay makes an application start to feel sluggish and undermines the user experience. Even more dramatic, though, is the difference additional data makes to the second query:
SELECT * FROM articles INNER JOIN articles_categories ON articles_categories.article_id = articles.id INNER JOIN categories ON articles_categories.category_id = categories.id INNER JOIN users ON articles.user_id = users.id INNER JOIN comments ON comments.article_id = articles.id WHERE users.email = 'firstname.lastname@example.org';6.69 seconds
Now we’re in dangerous territory. Performed routinely, such long-running queries could hamstring an application. The interface slows down. Server processes stack up, waiting for queries to finish. Browsers may time out waiting for data. Since the database and the web server accept a limited number of simultaneous connections, fewer connections are available for incoming requests while processes wait for a query.
Here’s what happens when we add indices to our tables:
Simple query: 0.01 seconds
Complex query: 0.32 seconds
Note that the simple query finishes just as quickly with 1,000 articles as it did when there were only 15. The complex query isn’t quite as fast, but it’s 20 times faster than before. Without indices, we were scanning every row of some tables, and the response time was directly related to the amount of data being scanned. When we index the specific columns we’re searching, the process becomes far more efficient. Think of it this way—it’s like looking up the word “poultry” in the back of the cookbook instead of flipping through each page and putting a marker on all the chicken dishes. Using indices saves time. The more data searched, the more time is saved by indexing.
What to index
In general, place an index on all foreign keys in a database. Like the
user_id column in the
articles table, a foreign key is a column that references the ID (or primary key) of another table, linking records across tables. Less obviously, indices should also be applied to any column used to limit a query that searches a large number of records. In our blogging example, we might want to index the
users table, because we often need to identify users by their e-mail addresses. If we know the application will never have more than a dozen users, it won’t make much difference, but if we expect the application’s user base to grow, it could be crucial. Every time a user signs in, we might have a query like this:
SELECT password FROM users WHERE email = '$my_email';
Running this query on a table with thousands of records could be problematic without an index on the
ALTER TABLE users ADD INDEX (email);
Most graphical database management tools (phpMyAdmin, for example) offer built-in controls for creating and managing indices. An index can also reference multiple columns together, which is useful when records are often identified by a combination of attributes. To make the following query more efficient, we could index both the
password columns jointly:
SELECT * FROM users WHERE email = '$my_email' AND password = '$my_password';
Know your tools
Understanding indices is particularly important when we rely on frameworks to write SQL for us. Frameworks are useful for a number of reasons, but we need to be careful. Installing Dreamweaver doesn’t obviate the need to understand XHTML and CSS, and building tables with Ruby on Rails migration scripts doesn’t eliminate the need to index them. And while a mature platform such as Wordpress will create well-indexed tables, the plugins we install for it may not. However, it’s also important not to go overboard. Building too many indices can also be problematic, because the server spends time determining which one(s) to use to satisfy a given query. The database must also update these indices when new records are added. As in any aspect of design, create what’s needed to solve the problem—nothing more or less.
The next level
Indexing database tables is an easy way to boost performance, and in many cases offers huge benefits, but it doesn’t solve every problem. Although optimizing queries for performance is a broad topic, there are a few general guidelines that may help.
Identify the problem child
Finding performance bottlenecks can be difficult. A single rendered page might involve several database queries, and responsiveness may vary depending on the specific data being loaded. Fortunately, many database servers will do the heavy lifting and create a log of slow queries. In MySQL, add the following lines to the server configuration file to create a running log of any query that takes longer than a second to run:
long_query_time = 1 log-slow-queries = /var/log/mysqld.slow.log
A line like the following one in the resulting log would reveal an opportunity for better indexing:
# Query_time: 7 Lock_time: 0 Rows_sent: 296 Rows_examined: 75872
This means that the database server examined over 75,000 rows to identify fewer than 300 results. That kind of disproportion usually indicates that full-table scans are happening.
Ask the server to explain itself
Most database servers offer a way to see the execution plan for a query—in other words, how the database thinks through the task we set for it. In MySQL, this is as simple as putting the word
EXPLAIN in front of a query, which we can copy and paste from the log. For example:
EXPLAIN SELECT * FROM articles INNER JOIN articles_categories ON articles_categories.article_id = articles.id INNER JOIN categories ON articles_categories.category_id = categories.id INNER JOIN users ON articles.user_id = users.id INNER JOIN comments ON comments.article_id = articles.id WHERE users.email = 'email@example.com';
In the original scenario with limited indices, the result might look something like this (omitting several columns for clarity):
This tells us a lot about how the database server handles each table included in the query. The
type column shows how rows are matched when tables are joined together. Tables marked
ALL indicate that a full-table scan is used. Columns labeled
key list some indices the server considered potentially useful to satisfy the query, and which, if any, are chosen. The
rows column shows how many rows the server thinks it will need to examine. The total number of potential results is the product of all values in that column (not the sum). In other words, for this query, the server anticipates processing up to 25 million combinations of rows—though the actual number will depend on what the real data looks like.
Extra column contains other hints about how the database processes information. In this case, it shows which data sets are limited by a
WHERE clause. With more complex queries, especially those that involve grouping or sorting operations, we could look for red flags such as the phrases
Using filesort or
Compare the output of
EXPLAIN for the indexed version of our database:
|users||ref||PRIMARY, email||1||Using where|
|articles_categories||ref||category_id, joint_index, article_id||article_id||1|
Notice that not only are the indices used, the order of operations is entirely different. The database server is smart enough to do the best it can with what it’s given, so in the first instance, it developed a plan to work as efficiently as possible without indices. Now that they’re available, the entire approach is different. Multiplying the values in the
rows column now gives us only 1,675 potential results—a tiny fraction of the original set.
Getting data from multiple tables means joining them together based on columns that link records. In the sample queries above, we specified an
INNER JOIN to select only rows matching all the specified conditions. Sometimes it’s helpful to use other join types, such as a
LEFT JOIN, where data can be returned even when there’s no matching row in one of the tables. In general, though, this second approach requires more work by the database.
For example, if we wanted to locate all the articles by a particular author and include category information, we might use a query like this with inner joins:
SELECT * FROM articles INNER JOIN users ON articles.user_id = users.id INNER JOIN articles_categories ON articles_categories.article_id = articles.id INNER JOIN categories ON articles_categories.category_id = categories.id WHERE users.email = 'firstname.lastname@example.org';
As long as our data integrity is strong, this will work well. But what if the application interface doesn’t require an author to list a category when publishing an article? In that case, this query might not give us all the articles we’re looking for by that author. Any articles that aren’t categorized will be left out of the results, because an inner join requires matching data in both tables. We could rewrite the query as follows:
SELECT * FROM articles INNER JOIN users ON articles.user_id = users.id LEFT JOIN articles_categories ON articles_categories.article_id = articles.id LEFT JOIN categories ON articles_categories.category_id = categories.id WHERE users.email = 'email@example.com';
This version will catch all the articles by the author in question, simply substituting
NULL values for missing category data. But it’s a much more intensive operation. In our sample database, the query using inner joins takes 0.06 seconds to complete, compared to 0.29 seconds for the left joins—nearly five times faster. Because they still include everything that’s returned by an inner join, left joins are sometimes used when they aren’t really needed.
Use only what you need
For simplicity, all the sample queries above used
SELECT * to load data, meaning we asked the database to return all the data for the rows we matched. This is almost certainly far more data than we actually need. This is particularly problematic with large text columns, such as blog posts in our
articles table. In the second sample query, we wanted to retrieve all the article comments for a particular author. The way the query is written, we’ll actually load the complete contents of the article (along with a lot of other data) not just once per article, but once per comment. So, if an article has 300 comments, the database will load the full text of that article 300 times. It would be much faster to select only the columns we’re interested in, like this:
SELECT comments.* FROM articles INNER JOIN [...]
In fact, making this change further speeds our earlier sample query by a factor of four, bringing it down to a respectable 0.08 seconds.
The basics are usually good enough
Fortunately, the easy changes often provide the greatest payoff. Deep analysis is usually only required for serious tuning. Following a handful of best practices in database design will improve speed and efficiency for most web applications.
Bear in mind that particular database servers, such as MySQL, PostgreSQL, or Microsoft SQL Server, differ in the way they execute queries and implement indices. It’s best to refer to your platform documentation for details, but the concepts remain the same in most cases. All the examples cited above were performed with MySQL 5.0. It’s also worth noting that there are different subtypes of index with specific properties—like forcing values to be unique or enabling full-text search.
It’s easy to blame sluggish performance on hardware or application code when all too often, the culprit lurks at the database level. If they’re implemented early, before responsiveness becomes a problem, a few small steps can make a huge improvement in your users’ experience.