Despite the fact that site search often receives the most traffic, it’s also the place where the user experience designer bears the least influence. Few tools exist to appraise the quality of the search experience, much less strategize ways to improve it. When it comes to site search, user experience designers are often sidelined like the single person at an old flame’s wedding: Everything seems to be moving along without you, and if you slipped out halfway through, chances are no one would notice. But relevancy testing and precision testing offer hope. These are two tools you can use to analyze and improve the search user experience.
You’ve already got everything you need
The search engine itself provides the critical resource you need to run these tests—the report of the most commonly submitted queries. These are all the search strings users have entered, exactly as they typed them, in descending order of popularity. Figure one shows such a report from the Michigan State University (MSU) website.
Fig. 1: Report of the most commonly submitted queries.
When you look at your full report, you’ll notice a small number of searches were submitted very often, and an enormous number were submitted just a few times. This is the universally observed “short head” / “long tail” described by the Zipf curve (Figure two). This means that you already know what most people are trying to find. Simply type those searches into the search engine to find out what visitors get. All you need now is a method (or two) to evaluate that experience. Enter relevancy testing and precision testing.
Fig. 2: Diagram of a Zipf Curve showing unique search phrases.
A relevancy evaluation measures how reliably the best result ranks at the very top of the search results. Think of this as a “simple question, simple answer” test. How well does the search engine do when people submit the clearest search queries? The best results should appear at the top of the list.
Relevancy testing: step one
To start, go down the list of popular searches and select the phrases where you feel very confident about the user’s intended target. Skip any searches for which:
- There’s more than one best target. For example, a search on the MSU website for “registrar” could either refer to the University registrar or to the Law School registrar, which have different pages. Neither is more plausible than the other.
- There is no relevant content on the site. “Football” has a single clear best target, but it’s hosted on a separate site that’s not accessible to the university’s search engine. That’s a problem, but you can’t hold it against the search engine.
- You’re not sure what the user is trying to find. “Parking” is a common search on MSU, but it’s vague. You don’t know if the user wants information on student parking, event parking, parking registration, visitor parking, or parking tickets.
Be sure to select enough phrases to demonstrate statistical significance. For our tests, we chose 80 phrases. (Remember that the searches at the top of the list account for a proportionally greater share of all searches.)
Actual intention vs. apparent intention
It’s important to understand that your subjective judgment of the user’s intention affects the test results. Isn’t that a problem? Does this subjective judgement make our tests less reliable?
Well, we can consider intention in two ways: First, there’s actual intention, which is the result the user really wanted when she typed in the search. And, of course, you can never glean actual intention from the search logs. Second, there’s apparent intention, which is how a reasonable person would interpret a search phrase. This is a critical point, because the search engine cannot be expected to do a better job than a human being. It’s not magic, but it’s fair to hold the search engine to that human standard, because that’s how well users expect it to perform.
This is why it’s important to keep only the phrases where you feel very confident about the user’s intended meaning—the well-phrased queries. If there’s any doubt about what the user wanted, skip it. There’s no shortage of phrases on the list.
Relevancy testing: step two
Next, submit each phrase into the site’s search engine. If the search engine worked perfectly, it would return that single best target as the very first result every time. Pinpoint where the best target actually falls in the list, and count how many spaces it is from the first position.
To help you conduct relevancy testing on your site, I prepared a spreadsheet you can use to enter your test phrases, their targets, and the position of the best match in the search results. The “report” tab automatically calculates bottom-line relevance metrics including mean rank, median rank, and how many times the target falls below the first, fifth, and tenth positions in the list. Use the scores from MSU’s search engine as benchmarks. This is a great way to present simple quality measures to the development team and to management. Here’s an example of a completed spreadsheet for reference.
Shortcomings of relevancy testing
While relevancy testing is useful, it tells an incomplete story. You’ll skip many of the search phrasings where you were uncertain of the user’s intention. Additionally, it focuses entirely on finding a single best target and ignores the quality of the other results returned. Precision testing closes these gaps, and along with relevancy testing, tells a compelling story of the search experience.
Think of a precision test as an archery competition: Each arrow counts, and no one expects them all to be dead on target (as long as they’re not flying off into the crowd). But the closer they come, the better the archer. Similarly, precision counts all of the results that the search engine returns, and asks how close they come to the target idea.
Within the context of information retrieval we define precision as:
Precision = Number of relevant results / Total number of results
Precision testing simply asks: “How many of the search engine results are of good quality?” So rather than look at the location of a single best target, precision testing measures actual engine results against how reasonable they are. It doesn’t mean we need to examine all the results returned, but just the few that most users will look at. For our testing, we limited it to the top five results.
Precision testing: step one
As with relevancy testing, start with the list of the most popular searches. But this time, don’t eliminate any of them. If we need 80 strings for statistical significance, then we take strings 1 through 80. The spreadsheet contains a tab called “precision,” where you can paste in your list.
Precision testing: step two
Try each string in the site’s search engine, and then click through to each of the top five results. In each case, ask yourself: “How reasonable was it for the search engine to return this page based on what I entered?” Remember, you’re not after the user’s actual intention, which may not even be knowable. Instead, you’re evaluating the extent to which the answer relates to the question.
Score the relevance of each of the results on a four-letter scale:
- Relevant: Based on the information the user provided, the page’s ranking is completely relevant. This is the best score you can give, and means that the result is exactly right.
- Near: The page is not a perfect match, but it is clearly reasonable for it to be ranked highly. No one would be surprised that the search term brought back such a result.
- Misplaced: You can see why the search engine returned the result, but it clearly shouldn’t be ranked highly. For example, a search for “bookstore” on MSU’s website returns the biography of a person who once worked at the bookstore. (Figure three.) Right word, wrong idea.
- Irrelevant: The result has no apparent relationship to the user’s search. Searching the MSU site for “football schedule” returns information about behind-the scenes tours of the Great Lakes Quilt Center collection. (Figure four.) A user could reasonably conclude that the search engine is off its rocker.
Fig. 3: The search engine returned this bookstore employee biography for the query “bookstore.” We rate this result “misplaced” according to our rating scale.
Fig 4: The search engine returned this result for the query “football schedule.” We rate this result “irrelevant” according to our rating scale.
Use the letter codes R, N, M, and I to record your scores in the spreadsheet for each of the top five results for each string. You’ll find it helpful to use a mnemonic to remember them. (I’m partial to “Ralph Nader Makes Igloos,” but feel free to invent your own.)
You can evaluate precision in several ways, depending on what you consider to be acceptable. I apply three standards to reflect the range of tolerance:
- Strict: Accept only the results ranked R, for completely relevant. This is ultimately impossible to attain, because perfect matches sometimes aren’t even available.
- Loose: Accept both Rs and Ns. This is more realistic, and a reasonable expectation to set for a search engine.
- Permissive: Accept Rs, Ns, and Ms. This is the bare minimum to which the search engine should perform, because it means that no crazy results were returned.
As you enter the scores into the spreadsheet, you’ll see that it automatically calculates precision by all three standards for each string inline, while the “report” tab aggregates the scores across the entire list.
Taken together, relevancy and precision tell a compelling story of the quality of the search experience. Moreover, they bring user experience designers into search analysis where conventional qualitative methods leave us standing at the periphery.
At my organization, we used these metrics to identify weaknesses in the configuration of our search engine, and as a yardstick to track improvement as we implemented optimization, best bets, and a thesaurus. Using these tools, our designers were able to show the need for change and demonstrate the effectiveness of those changes as they were made.
Site search shouldn’t be viewed as purely a technology problem. Designers have a direct role to play in the marriage of search and user experience; we just need to apply techniques that expose the real problems we experience when we search for information so that we can fix them.