Testing Search for Relevancy and Precision

Despite the fact that site search often receives the most traffic, it’s also the place where the user experience designer bears the least influence. Few tools exist to appraise the quality of the search experience, much less strategize ways to improve it. When it comes to site search, user experience designers are often sidelined like the single person at an old flame’s wedding: Everything seems to be moving along without you, and if you slipped out halfway through, chances are no one would notice. But relevancy testing and precision testing offer hope. These are two tools you can use to analyze and improve the search user experience.

Article Continues Below

You’ve already got everything you need#section2

The search engine itself provides the critical resource you need to run these tests—the report of the most commonly submitted queries. These are all the search strings users have entered, exactly as they typed them, in descending order of popularity. Figure one shows such a report from the Michigan State University (MSU) website.

Fig. 1: Report of the most commonly submitted queries.

When you look at your full report, you’ll notice a small number of searches were submitted very often, and an enormous number were submitted just a few times. This is the universally observed “short head” / “long tail” described by the Zipf curve (Figure two). This means that you already know what most people are trying to find. Simply type those searches into the search engine to find out what visitors get. All you need now is a method (or two) to evaluate that experience. Enter relevancy testing and precision testing.

Fig. 2: Diagram of a Zipf Curve showing unique search phrases.

Relevancy testing#section3

A relevancy evaluation measures how reliably the best result ranks at the very top of the search results. Think of this as a “simple question, simple answer” test. How well does the search engine do when people submit the clearest search queries? The best results should appear at the top of the list.

Relevancy testing: step one#section4

To start, go down the list of popular searches and select the phrases where you feel very confident about the user’s intended target. Skip any searches for which:

There’s more than one best target. For example, a search on the MSU website for “registrar” could either refer to the University registrar or to the Law School registrar, which have different pages. Neither is more plausible than the other.
There is no relevant content on the site. “Football” has a single clear best target, but it’s hosted on a separate site that’s not accessible to the university’s search engine. That’s a problem, but you can’t hold it against the search engine.
You’re not sure what the user is trying to find. “Parking” is a common search on MSU, but it’s vague. You don’t know if the user wants information on student parking, event parking, parking registration, visitor parking, or parking tickets.

Be sure to select enough phrases to demonstrate statistical significance. For our tests, we chose 80 phrases. (Remember that the searches at the top of the list account for a proportionally greater share of all searches.)

Actual intention vs. apparent intention#section5

It’s important to understand that your subjective judgment of the user’s intention affects the test results. Isn’t that a problem? Does this subjective judgement make our tests less reliable?

Well, we can consider intention in two ways: First, there’s actual intention, which is the result the user really wanted when she typed in the search. And, of course, you can never glean actual intention from the search logs. Second, there’s apparent intention, which is how a reasonable person would interpret a search phrase. This is a critical point, because the search engine cannot be expected to do a better job than a human being. It’s not magic, but it’s fair to hold the search engine to that human standard, because that’s how well users expect it to perform.

This is why it’s important to keep only the phrases where you feel very confident about the user’s intended meaning—the well-phrased queries. If there’s any doubt about what the user wanted, skip it. There’s no shortage of phrases on the list.

Relevancy testing: step two#section6

Next, submit each phrase into the site’s search engine. If the search engine worked perfectly, it would return that single best target as the very first result every time. Pinpoint where the best target actually falls in the list, and count how many spaces it is from the first position.

To help you conduct relevancy testing on your site, I prepared a spreadsheet you can use to enter your test phrases, their targets, and the position of the best match in the search results. The “report” tab automatically calculates bottom-line relevance metrics including mean rank, median rank, and how many times the target falls below the first, fifth, and tenth positions in the list. Use the scores from MSU’s search engine as benchmarks. This is a great way to present simple quality measures to the development team and to management. Here’s an example of a completed spreadsheet for reference.

Shortcomings of relevancy testing#section7

While relevancy testing is useful, it tells an incomplete story. You’ll skip many of the search phrasings where you were uncertain of the user’s intention. Additionally, it focuses entirely on finding a single best target and ignores the quality of the other results returned. Precision testing closes these gaps, and along with relevancy testing, tells a compelling story of the search experience.

Precision testing#section8

Think of a precision test as an archery competition: Each arrow counts, and no one expects them all to be dead on target (as long as they’re not flying off into the crowd). But the closer they come, the better the archer. Similarly, precision counts all of the results that the search engine returns, and asks how close they come to the target idea.

Within the context of information retrieval we define precision as:

Precision = Number of relevant results / Total number of results

Precision testing simply asks: “How many of the search engine results are of good quality?” So rather than look at the location of a single best target, precision testing measures actual engine results against how reasonable they are. It doesn’t mean we need to examine all the results returned, but just the few that most users will look at. For our testing, we limited it to the top five results.

Precision testing: step one#section9

As with relevancy testing, start with the list of the most popular searches. But this time, don’t eliminate any of them. If we need 80 strings for statistical significance, then we take strings 1 through 80. The spreadsheet contains a tab called “precision,” where you can paste in your list.

Precision testing: step two#section10

Try each string in the site’s search engine, and then click through to each of the top five results. In each case, ask yourself: “How reasonable was it for the search engine to return this page based on what I entered?” Remember, you’re not after the user’s actual intention, which may not even be knowable. Instead, you’re evaluating the extent to which the answer relates to the question.

Score the relevance of each of the results on a four-letter scale:

Relevant: Based on the information the user provided, the page’s ranking is completely relevant. This is the best score you can give, and means that the result is exactly right.
Near: The page is not a perfect match, but it is clearly reasonable for it to be ranked highly. No one would be surprised that the search term brought back such a result.
Misplaced: You can see why the search engine returned the result, but it clearly shouldn’t be ranked highly. For example, a search for “bookstore” on MSU’s website returns the biography of a person who once worked at the bookstore. (Figure three.) Right word, wrong idea.
Irrelevant: The result has no apparent relationship to the user’s search. Searching the MSU site for “football schedule” returns information about behind-the scenes tours of the Great Lakes Quilt Center collection. (Figure four.) A user could reasonably conclude that the search engine is off its rocker.

Fig. 3: The search engine returned this bookstore employee biography for the query “bookstore.” We rate this result “misplaced” according to our rating scale.

Fig 4: The search engine returned this result for the query “football schedule.” We rate this result “irrelevant” according to our rating scale.

Use the letter codes R, N, M, and I to record your scores in the spreadsheet for each of the top five results for each string. You’ll find it helpful to use a mnemonic to remember them. (I’m partial to “Ralph Nader Makes Igloos,” but feel free to invent your own.)

Calculating precision#section11

You can evaluate precision in several ways, depending on what you consider to be acceptable. I apply three standards to reflect the range of tolerance:

Strict: Accept only the results ranked R, for completely relevant. This is ultimately impossible to attain, because perfect matches sometimes aren’t even available.
Loose: Accept both Rs and Ns. This is more realistic, and a reasonable expectation to set for a search engine.
Permissive: Accept Rs, Ns, and Ms. This is the bare minimum to which the search engine should perform, because it means that no crazy results were returned.

As you enter the scores into the spreadsheet, you’ll see that it automatically calculates precision by all three standards for each string inline, while the “report” tab aggregates the scores across the entire list.

Conclusion#section12

Taken together, relevancy and precision tell a compelling story of the quality of the search experience. Moreover, they bring user experience designers into search analysis where conventional qualitative methods leave us standing at the periphery.

At my organization, we used these metrics to identify weaknesses in the configuration of our search engine, and as a yardstick to track improvement as we implemented optimization, best bets, and a thesaurus. Using these tools, our designers were able to show the need for change and demonstrate the effectiveness of those changes as they were made.

Site search shouldn’t be viewed as purely a technology problem. Designers have a direct role to play in the marriage of search and user experience; we just need to apply techniques that expose the real problems we experience when we search for information so that we can fix them.

11 Reader Comments

Jim K says:

September 22, 2009 at 12:45 pm

John,

Shouldn’t a good search engine provide disambiguation help? In your example of “parking”, Google would suggest the top relevant related searches to automagically refine the user’s query.

Jim
Michelangelo Iaffaldano says:

September 22, 2009 at 1:25 pm

Good article, thanks. I wish you could expand on “[“¦] we used these metrics to identify weaknesses in the configuration of our search engine, and as a yardstick to track improvement as we implemented optimization, best bets, and a thesaurus.”
John Ferrara says:

September 22, 2009 at 6:14 pm

Hi Jim,

Thanks for the question. Disambiguation or “narrow your results” is a technology extension that supplements the core function of the search engine. But the methods I discuss here focus specifically on the quality of the core relevancy calculation, because it’s the basis of a quality search experience.

Functional add-ons (narrower, broader, filters, similar, etc.) are often great additions to a search engine. But not always. They’re sometimes implemented as technology crutches without regard to whether they solve any actual problems. Sometimes they’re only helpful in a few circumstances, and otherwise only clutter up the results page. People furthermore often won’t make use of a function that requires additional work of them. Finally, not all designers are working with products that have those kinds of capabilities.

At its heart, a search has to be good at judging the relevance of documents to a user’s query. That’s really what you bought it to do. So to measure that accurately, we need to temporarily set aside the effect of functions that might (or might not!) be used to improve the search once it’s already been submitted.

John
John Ferrara says:

September 23, 2009 at 1:30 pm

Thanks so much for the question.

One of the great things about working with quantitative numbers is that they allow you to summarize a complex problem in a very concise way. While search is a qualitative experience, the methods I describe here explain it in clear, simple numbers.

So for example, after completing the evaluation you can present the results like this:

– Our mean relevancy score is currently 5.7, and we want to bring that up to 2.5.
– Currently, 11% of the best matches fall below the 10th position. We want to reduce that to 5%.
– By the loose standard, our precision score is currently 63%. We want to bring that up to 75%.
– By the permissive standard, our precision score is currently 89%. We want to bring that up to 98%.

These metrics create a compelling case for further work to improve the quality of the search experience, and suggest the type of work that needs to be done. For example, solutions like engine tuning, thesaurus, and spellcheck improve the quality of all searches, while optimization and bets bets fix stubborn outliers that remain problematic.

In the past, I’ve used these methods to set objectives and create improvement plans in just this way. Not only has it been effective, but it’s often significantly overshot the improvement target resulting in a screamingly great search experience.
Bersimon says:

September 25, 2009 at 5:19 pm

Hi, that was a clear and helpful article! My site has very few visits, but still it’s good to know a good way to analyze site search experience.

I would just like to point out that the first spreadsheet link appears to be broken, as it points to an invalid destination (http://d/).
John Ferrara says:

September 28, 2009 at 1:05 pm

Bersimon,

Glad you found it helpful! Sorry you couldn’t download the spreadsheet; I’m not having the same problem. Try following this link:

http://tinyurl.com/yc7x633

Thanks for the comment,

John
patrick_l says:

September 29, 2009 at 12:43 pm

Thank you for this article! That were things I did not think about until now, but I got the impression that I have learned something ;). I am sure it will be helpful for me when I have to maintain bigger websites.
Jeroen says:

October 5, 2009 at 5:50 am

I’ve run into the issue of scoring a result set for usability evaluation before (using different interfaces for complex queries but it is the same difference); One of the things I used is the typical hit and run behavior that we know google users are using; if the right result is not within the first 10 results, users rather re-query than goto the next page. So results after 10 are typically unimportant. The good old “precision” used to tweak engines is less important for these reasons and less useful in this kind of evaluation.

In order to be able to have a user based scoring for the search results to a given query, one could count the relevant results within the top 10 and use the position of each of those results to create an aggregate score for the result set; say the 1st and 3rd result are relevant out of 100 results, you could give a score of (1/1+2/3)/10 = 0.17; if the second relevant result would be 2 in the result deck, the score would’ve been 0.2 etc.

It would be even better if you have a couple of people evaluate the results for relevancy instead of the single ambiguous you.
Ledderman says:

December 31, 2009 at 9:56 am

Thanks for the interesting article.
Sometimes it is also a good idea to spread the search results into a section which is manually predetermined, if a certain keyword appears in the query, and a section, which is generated totally automatically by the search engine. So if some keywords were searched for quite often, you can at least present for these queries the most relevant results.
searchtools-avi says:

February 4, 2010 at 6:51 pm

I think this is an excellent step-by-step explanation of how to evaluate recall, precision, and relevance. It’s not just metrics, it explains how to use the research, very helpful.

I believe Ledderman above is referring to Search Suggestions / Best Bets. These are particularly useful in cases like your Football example, where a static link to the sports department would serve the users.
Pingback: Two points make a line – The Interconnected