In 1998, usability expert Rolf Molich (co-inventor with Jakob Nielsen of the heuristic evaluation method) gave nine teams three weeks to evaluate the webmail application www.hotmail.com. The experiment was part of his series of Comparative Usability Evaluations (CUEs), through which he began to identify a set of standards and best practices for usability tests. In each segment of the series, Molich asked several usability teams to evaluate a single design using the method of their choice.
From the documented results of the second test, called CUE-2, a surprising trend appeared. Contrary to claims that usability professionals operate scientifically to identify problems in an interface, usability evaluations are at best less than scientific.
In an interview with Christine Perfetti published in User Interface Engineering, Molich said:
In CUE-4, run in 2003, 17 teams evaluated the Hotel Penn website, which featured a Flash-based reservation system developed by iHotelier. Of the 17 teams, nine ran usability tests, and the remaining eight performed expert reviews.
Collectively, the teams reported 340 usability problems. However, only nine of these problems were reported by more than half of the teams. And a total of 205 problems—60% of all the findings reported—were identified only once. Of the 340 usability problems identified, 61 problems were classifed as “serious” or “critical” problems.
Think about that for a moment.
For the Hotmail team to have identified all of the “serious” usability problems discovered in the evaluation process, it would have to have hired all nine usability teams. In CUE-4, to spot all 61 serious problems, the Hotel Penn team would have to have hired all 17 usability teams. Seventeen!
Asked how development teams could be confident they are addressing the right problems on their websites, Molich concluded, “It’s very simple: They can’t be sure!”
Why usability evaluation is unreliable
Usability evaluations are good for a lot of things, but determining what a team’s priorities should be is not one of them. Fortunately, there is an explanation for these counterintuitive outcomes that can help us choose a more appropriate evaluation course.
Right questions, wrong people, and vice versa
First, different teams get different results because tests and research are often performed poorly: teams either ask the right questions of the wrong people or ask the wrong questions of the right people.
In one recent case, the project goal was to improve usability for a site’s new users. A card-sorting session—a perfectly appropriate discovery method for planning information architecture changes—revealed that the existing, less-than-ideal terminology used throughout the site should be retained. This happened because the team ran the card-sort with existing site users instead of the new users it aimed to entice.
In another case, a team charged with improving the usability of a web application clearly in need of an overhaul ran usability tests to identify major problems. In the end, they determined that the rather poorly-designed existing task flows should not only be kept, but featured. This team, too, ran its tests with existing users, who had—as one might guess—become quite proficient at navigating the inadequate interaction model.
Usability teams also have wildly differing experience levels, skill sets, degrees of talent, and knowledge, and although some research and testing methods have been homogenized to the point that anyone should be able to perform them proficiently, a team’s savvy (or lack thereof) can affect the results it gets. That almost anyone can perform a heuristic evaluation doesn’t mean the outcome will always be useful or even accurate. Heuristics are not a checklist, they are guidelines a usability evaluator can use as a baseline from which to apply her expertise. They are a beginning, not an end.
Testing and evaluation is useless without context
Next, while usability testing is perhaps no more reliable a prioritization method than an expert-level, qualitative evaluation performed by a lone reviewer or a small group of reviewers, testing is like any other evaluation or discovery method: It must be, but frequently is not, put in context. Page views and time-spent-per-page metrics, while often foolishly considered standard measures of site effectiveness, are meaningless until they are considered in context of the goals of the pages being visited.
Is a user who visits a series of pages doing so because the task flow is effective, or because he can’t find the content he seeks? Are users spending a lot of time on a page because they’re engaged, or because they’re stuck? While NYTimes.com surely hopes readers will stay on a page long enough to read an article in full or scan all its headlines, Google’s goal is for users to find what they need and leave a search results page as quickly as possible. A lengthy time-spent metric on NYTimes.com could indicate a high-quality or high-value article. For Google’s search workflow, it could indicate a team’s utter failure.
I suspect that the teams Rolf Molich hired were asked to do their evaluations without first going through a discovery process to reveal business goals and user goals, or to determine success metrics. This lack of information may have been responsible for the skewed results. Regardless, these indications of the unreliability of evaluation methods allow us to identify more appropriate research and testing solutions.
What testing is really good for
Malcolm Gladwell’s bestselling book Blink begins with a story about a seemingly ancient marble statue. When several experts in Greek sculpture evaluated it, each pronounced the artifact was a fake. Without a shred of scientific evidence, these experts simply looked over the object and saw that it couldn’t possibly have been created during the time period its finders claimed. These experts couldn’t, in most cases, explain their beliefs. They just knew.
The experts were able to do this because they had each spent thousands of hours sharpening their instincts through research and practice. They had studied their craft so much that they spotted the fraud almost instantly, though they were often unable to articulate what gave away the object as a fake.
Usability testing informs the designer and the design
A good usability professional must be able to identify high-priority problems and make appropriate recommendations—and the best evaluators do this quickly and reliably—but a good designer must also be able to design well in the first place. This is one area in which usability testing has real power. It can hone designers’ instincts so they can spot potential usability problems and improve the designs without the cost of formal testing on every project.
And interestingly, many of the most compelling usability test insights come not from the elements that are evaluated, but rather those not evaluated. They come from the almost unnoticeable moments when a user frowns at a button label, or obviously rates a task flow as easier than it appeared during completion, or claims to understand a concept while simultaneously misdefining it. The unintended conclusions—the peripheral insights—are often what feed a designer’s instincts most. Over time, testing sessions can strengthen a designer’s intuition so that she can spot troublesome design details with just a glance. Simply put, usability tests can provide huge insight into the patterns and nuances of human behavior.
This notion alone, however, is unlikely to justify the expense of testing to organizations struggling with profitability. It’s usually only after a company has become successful that testing becomes routine, so designers and usability professionals must rely on other justifications. Fortunately, there are several.
Usability testing justified
First, usability testing has high shock value. Teams invariably conclude their initial sessions surprised to learn they had not noticed glaringly obvious design problems. This shock alone is often enough to drive a team toward a more strategic approach in which it reverts to what should have been the earliest phase of the process: Determining the project’s goals and forming a comprehensive strategy for achieving them. In short, it convinces teams that something is wrong and motivates them to take action. As the saying goes, knowing is half the battle.
Second, testing helps establish trust with stakeholders. For an internal project, testing helps quell management and stakeholder concerns about the validity of a design team’s findings and recommendations. It’s not enough, in other words, to hire experienced practitioners—those practitioners must then prove themselves repeatedly until teams begin to trust their expertise. Testing offers a basis for that trust.
Finally, while testing alone is not a good indicator of where a team’s priorities should lie, it is most certainly part of the triangulation process. When put in context of other data, such as project goals, user goals, user feedback, and usage metrics, testing helps establish a complete picture. Without this context, however, testing can be misleading or misunderstood at best, and outright damaging at worst. This is also true for non-testing-based evaluation methods, such as heuristic reviews.
Adapting to the reality
There is a catch to all of the preceding arguments, however: They revolve around the notion that testing should be used primarily to identify problems with existing designs. This is where teams get into trouble—they assume testing is worth more than it truly is, resolve to address problems based purely on testing data, and revise strategies based entirely on comments made by test participants. None of these things reliably lead to positive outcomes, nor do they ensure a team will emerge from the process any wiser than the day before.
As we’ve seen, test results and research can point teams toward solutions that are not only ill-advised, but in direct conflict with their goals. It’s only natural that existing users perform tasks capably and comfortably despite poor task design. After all, the most usable application is the one you already know. But this doesn’t mean poor designs should not be revamped. Rather, to adapt to and harness the power of usability testing, current users should be brought in to test new ideas—ideas that surface from expert evaluation and collaboration with designers to create new solutions.
What they should have done
The team that ran the card-sort in the earlier example should have devised a new set of terms and used testing to validate them, rather than ask users to determine which terms to apply in the first place.
The team that decided to feature poorly-designed task flows because its existing audience could proficiently use them should have prototyped new task flows and run test sessions to validate usability with existing and first-time users.
To identify problems on which to focus, these teams, and yours, can take a variety of approaches. Consider a revised workflow that begins with an expert-level heuristic evaluation used in conjunction with informal testing methods, followed by informal and formal testing. More specifically, consider using online tools and paid services to investigate hunches, then use more formal methods to test and validate revised solutions that involve a designer’s input.
Here are several tools that can be used with a heuristic evaluation to identify trouble spots:
- Five-second tests: Show a screen to a user for five seconds and ask her to write down everything she remembers. In task-focused screens, ask the user how to perform a core task, and then show her the screen and ask her to tell you her answer. Five-second tests can be run online using the free service, www.fivesecondtest.com.
- Click stats: Use Crazy Egg to track clicks on specific pages on live sites. These metrics can shed light on whether or not an ad is effective, a task flow is clear, or a bit of instructive micro-copy is helpful.
- Usability testing services: User Testing locates participants according to demographic requirements you set, has them complete the tasks you identify, and sends you the results, complete with a screen recording of each test session, for $29 per participant.
- Click stats on screenshots: Chalkmark offers essentially the same service as Crazy Egg, but uses screenshots rather than live pages. This way, you can analyze a screen’s usability before the design goes live, which is, of course, the best time to do it.
In handling usability projects in this way, teams will identify priorities and achieve better outcomes, and can still gain all the benefits of being actively involved with usability tests.
The major caveat to all of these methods is that users who are invested in completing a task act very differently than those who are not. A test participant who really wants to buy a digital camera will behave differently on a commerce site than a participant whose only motivation is to be compensated. Those who are invested in the tasks will persevere through far more problems than those who are not. When using any of these methods, it’s important to try to find participants who actually want to complete the very tasks you wish to evaluate.
Obviously, not every team or organization can bear the expense of usability testing. In the end, you can do only what’s most feasible in your particular situation. But if testing is an option—whether as a one-time experiment or already part of your regular routine—be sure to use the tool for the right job, and be sure to approach the process with clear expectations.
Usability professionals may prefer that Molich’s story be kept quiet. Not because it delegitimizes the profession, but because it can be easily misunderstood if told outside of its context. While usability testing fails wholly to do what many people think is its most pertinent and relevant purpose—to identify problems and point a team in the right direction—it does provide a direct path for observing human behavior, it does a brilliant job of informing a designer’s instincts over time, it builds trust with stakeholders, and it’s a very effective tool for validating design ideas.
Test for the right reasons and you stand a good chance of achieving a positive outcome. Test for the wrong ones, however, and you may not only produce misleading results, but also put your entire business at risk.