The Myth of Usability Testing
Issue № 294

The Myth of Usability Testing

In 1998, usability expert Rolf Molich (co-inventor with Jakob Nielsen of the heuristic evaluation method) gave nine teams three weeks to evaluate the webmail application www.hotmail.com. The experiment was part of his series of Comparative Usability Evaluations (CUEs), through which he began to identify a set of standards and best practices for usability tests. In each segment of the series, Molich asked several usability teams to evaluate a single design using the method of their choice.

Article Continues Below

From the documented results of the second test, called CUE-2, a surprising trend appeared. Contrary to claims that usability professionals operate scientifically to identify problems in an interface, usability evaluations are at best less than scientific.

In an interview with Christine Perfetti published in User Interface Engineering, Molich said:

The CUE-2 teams reported 310 different usability problems. The most frequently reported problem was reported by seven of the nine teams. Only six problems were reported by more than half of the teams, while 232 problems (75 percent) were reported only once. Many of the problems that were classified as “serious” were only reported by a single team. Even the tasks used by most or all teams produced very different results—around 70 percent of the findings for each of these common tasks were unique.

In CUE-4, run in 2003, 17 teams evaluated the Hotel Penn website, which featured a Flash-based reservation system developed by iHotelier. Of the 17 teams, nine ran usability tests, and the remaining eight performed expert reviews.

Collectively, the teams reported 340 usability problems. However, only nine of these problems were reported by more than half of the teams. And a total of 205 problems—60% of all the findings reported—were identified only once. Of the 340 usability problems identified, 61 problems were classifed as “serious” or “critical” problems.

Think about that for a moment.

For the Hotmail team to have identified all of the “serious” usability problems discovered in the evaluation process, it would have to have hired all nine usability teams. In CUE-4, to spot all 61 serious problems, the Hotel Penn team would have to have hired all 17 usability teams. Seventeen!

Asked how development teams could be confident they are addressing the right problems on their websites, Molich concluded, “It’s very simple: They can’t be sure!”

Why usability evaluation is unreliable#section1

Usability evaluations are good for a lot of things, but determining what a team’s priorities should be is not one of them. Fortunately, there is an explanation for these counterintuitive outcomes that can help us choose a more appropriate evaluation course.

Right questions, wrong people, and vice versa#section2

First, different teams get different results because tests and research are often performed poorly: teams either ask the right questions of the wrong people or ask the wrong questions of the right people.

In one recent case, the project goal was to improve usability for a site’s new users. A card-sorting session—a perfectly appropriate discovery method for planning information architecture changes—revealed that the existing, less-than-ideal terminology used throughout the site should be retained. This happened because the team ran the card-sort with existing site users instead of the new users it aimed to entice.

In another case, a team charged with improving the usability of a web application clearly in need of an overhaul ran usability tests to identify major problems. In the end, they determined that the rather poorly-designed existing task flows should not only be kept, but featured. This team, too, ran its tests with existing users, who had—as one might guess—become quite proficient at navigating the inadequate interaction model.

Usability teams also have wildly differing experience levels, skill sets, degrees of talent, and knowledge, and although some research and testing methods have been homogenized to the point that anyone should be able to perform them proficiently, a team’s savvy (or lack thereof) can affect the results it gets. That almost anyone can perform a heuristic evaluation doesn’t mean the outcome will always be useful or even accurate. Heuristics are not a checklist, they are guidelines a usability evaluator can use as a baseline from which to apply her expertise. They are a beginning, not an end.

Testing and evaluation is useless without context#section3

Next, while usability testing is perhaps no more reliable a prioritization method than an expert-level, qualitative evaluation performed by a lone reviewer or a small group of reviewers, testing is like any other evaluation or discovery method: It must be, but frequently is not, put in context. Page views and time-spent-per-page metrics, while often foolishly considered standard measures of site effectiveness, are meaningless until they are considered in context of the goals of the pages being visited.

Is a user who visits a series of pages doing so because the task flow is effective, or because he can’t find the content he seeks? Are users spending a lot of time on a page because they’re engaged, or because they’re stuck? While NYTimes.com surely hopes readers will stay on a page long enough to read an article in full or scan all its headlines, Google’s goal is for users to find what they need and leave a search results page as quickly as possible. A lengthy time-spent metric on NYTimes.com could indicate a high-quality or high-value article. For Google’s search workflow, it could indicate a team’s utter failure.

I suspect that the teams Rolf Molich hired were asked to do their evaluations without first going through a discovery process to reveal business goals and user goals, or to determine success metrics. This lack of information may have been responsible for the skewed results. Regardless, these indications of the unreliability of evaluation methods allow us to identify more appropriate research and testing solutions.

What testing is really good for#section4

Malcolm Gladwell’s bestselling book Blink begins with a story about a seemingly ancient marble statue. When several experts in Greek sculpture evaluated it, each pronounced the artifact was a fake. Without a shred of scientific evidence, these experts simply looked over the object and saw that it couldn’t possibly have been created during the time period its finders claimed. These experts couldn’t, in most cases, explain their beliefs. They just knew.

The experts were able to do this because they had each spent thousands of hours sharpening their instincts through research and practice. They had studied their craft so much that they spotted the fraud almost instantly, though they were often unable to articulate what gave away the object as a fake.

Usability testing informs the designer and the design#section5

A good usability professional must be able to identify high-priority problems and make appropriate recommendations—and the best evaluators do this quickly and reliably—but a good designer must also be able to design well in the first place. This is one area in which usability testing has real power. It can hone designers’ instincts so they can spot potential usability problems and improve the designs without the cost of formal testing on every project.

And interestingly, many of the most compelling usability test insights come not from the elements that are evaluated, but rather those not evaluated. They come from the almost unnoticeable moments when a user frowns at a button label, or obviously rates a task flow as easier than it appeared during completion, or claims to understand a concept while simultaneously misdefining it. The unintended conclusions—the peripheral insights—are often what feed a designer’s instincts most. Over time, testing sessions can strengthen a designer’s intuition so that she can spot troublesome design details with just a glance. Simply put, usability tests can provide huge insight into the patterns and nuances of human behavior.

This notion alone, however, is unlikely to justify the expense of testing to organizations struggling with profitability. It’s usually only after a company has become successful that testing becomes routine, so designers and usability professionals must rely on other justifications. Fortunately, there are several.

Usability testing justified#section6

First, usability testing has high shock value. Teams invariably conclude their initial sessions surprised to learn they had not noticed glaringly obvious design problems. This shock alone is often enough to drive a team toward a more strategic approach in which it reverts to what should have been the earliest phase of the process: Determining the project’s goals and forming a comprehensive strategy for achieving them. In short, it convinces teams that something is wrong and motivates them to take action. As the saying goes, knowing is half the battle.

Second, testing helps establish trust with stakeholders. For an internal project, testing helps quell management and stakeholder concerns about the validity of a design team’s findings and recommendations. It’s not enough, in other words, to hire experienced practitioners—those practitioners must then prove themselves repeatedly until teams begin to trust their expertise. Testing offers a basis for that trust.

Finally, while testing alone is not a good indicator of where a team’s priorities should lie, it is most certainly part of the triangulation process. When put in context of other data, such as project goals, user goals, user feedback, and usage metrics, testing helps establish a complete picture. Without this context, however, testing can be misleading or misunderstood at best, and outright damaging at worst. This is also true for non-testing-based evaluation methods, such as heuristic reviews.

Adapting to the reality#section7

There is a catch to all of the preceding arguments, however: They revolve around the notion that testing should be used primarily to identify problems with existing designs. This is where teams get into trouble—they assume testing is worth more than it truly is, resolve to address problems based purely on testing data, and revise strategies based entirely on comments made by test participants. None of these things reliably lead to positive outcomes, nor do they ensure a team will emerge from the process any wiser than the day before.

As we’ve seen, test results and research can point teams toward solutions that are not only ill-advised, but in direct conflict with their goals. It’s only natural that existing users perform tasks capably and comfortably despite poor task design. After all, the most usable application is the one you already know. But this doesn’t mean poor designs should not be revamped. Rather, to adapt to and harness the power of usability testing, current users should be brought in to test new ideas—ideas that surface from expert evaluation and collaboration with designers to create new solutions.

What they should have done#section8

The team that ran the card-sort in the earlier example should have devised a new set of terms and used testing to validate them, rather than ask users to determine which terms to apply in the first place.

The team that decided to feature poorly-designed task flows because its existing audience could proficiently use them should have prototyped new task flows and run test sessions to validate usability with existing and first-time users.

To identify problems on which to focus, these teams, and yours, can take a variety of approaches. Consider a revised workflow that begins with an expert-level heuristic evaluation used in conjunction with informal testing methods, followed by informal and formal testing. More specifically, consider using online tools and paid services to investigate hunches, then use more formal methods to test and validate revised solutions that involve a designer’s input.

Here are several tools that can be used with a heuristic evaluation to identify trouble spots:

  • Five-second tests: Show a screen to a user for five seconds and ask her to write down everything she remembers. In task-focused screens, ask the user how to perform a core task, and then show her the screen and ask her to tell you her answer. Five-second tests can be run online using the free service, www.fivesecondtest.com.
  • Click stats: Use Crazy Egg to track clicks on specific pages on live sites. These metrics can shed light on whether or not an ad is effective, a task flow is clear, or a bit of instructive micro-copy is helpful.
  • Usability testing services: User Testing locates participants according to demographic requirements you set, has them complete the tasks you identify, and sends you the results, complete with a screen recording of each test session, for $29 per participant.
  • Click stats on screenshots: Chalkmark offers essentially the same service as Crazy Egg, but uses screenshots rather than live pages. This way, you can analyze a screen’s usability before the design goes live, which is, of course, the best time to do it.

In handling usability projects in this way, teams will identify priorities and achieve better outcomes, and can still gain all the benefits of being actively involved with usability tests.

The major caveat to all of these methods is that users who are invested in completing a task act very differently than those who are not. A test participant who really wants to buy a digital camera will behave differently on a commerce site than a participant whose only motivation is to be compensated. Those who are invested in the tasks will persevere through far more problems than those who are not. When using any of these methods, it’s important to try to find participants who actually want to complete the very tasks you wish to evaluate.

Conclusions#section9

Obviously, not every team or organization can bear the expense of usability testing. In the end, you can do only what’s most feasible in your particular situation. But if testing is an option—whether as a one-time experiment or already part of your regular routine—be sure to use the tool for the right job, and be sure to approach the process with clear expectations.

Usability professionals may prefer that Molich’s story be kept quiet. Not because it delegitimizes the profession, but because it can be easily misunderstood if told outside of its context. While usability testing fails wholly to do what many people think is its most pertinent and relevant purpose—to identify problems and point a team in the right direction—it does provide a direct path for observing human behavior, it does a brilliant job of informing a designer’s instincts over time, it builds trust with stakeholders, and it’s a very effective tool for validating design ideas.

Test for the right reasons and you stand a good chance of achieving a positive outcome. Test for the wrong ones, however, and you may not only produce misleading results, but also put your entire business at risk.

About the Author

Robert Hoekman Jr.

Robert Hoekman, Jr. is a user experience specialist and writer. He helps clients develop user experience strategies. He does usability evaluations, trains design teams and speaks at conferences. His latest book is Web Anatomy, coauthored by Jared Spool.

29 Reader Comments

  1. Thanks for a thought provoking article! I wasn’t aware of the Molich experiments, but am glad I am now, and enjoyed the tips on simpler ways to do user testing.

    I’m a data-driven design guy, so I appreciate the more evidence-based approaches to what we do in design and usability, but I worry here we’re replacing one flawed system with another.

    Firstly, as for the tools you recommend (most of which I would echo), aren’t they just as prone to the problems Molich found? I can’t imagine the five second test being particularly meaningful, for instance.

    Secondly, hasn’t this desire for more usability testing come about after recognizing the limitations of our intuition? Sure, it’s a nice ego boost to think that I, Joe or Jane Professional, can simply know, Gladwell-style, in the blink of an eye what the problems are (though I wouldn’t cite “Blink” as evidence for anything, really), but I don’t think it holds up in reality. If it did, I would spot problems costing businesses $millions, charge them $1 million for my blink-of-an-eye fix, they could fire their internal team, make millions, and everyone wins. Well, everyone except their internal team.

    What evidence is there that a well trained intuition isn’t just as faulty as the usability teams? Ask a dozen designers, get a dozen different answers.

    It’s like saying an experienced stock trader can just “know” which stocks are going up. If that were true, we’d all give them our money, and we’d all be rich.
    I think our intuition is brilliant for coming up with ideas and _possible_ solutions (our design is only ever going to be as good as our best ideas, after all), but we, as a profession, still need a formal way of measuring, testing, and publishing our results, in my opinion.
    I’m not for flawed usability testing as descried — pseudo-science doesn’t help anyone — but nor am I for returning to expert intuition either, as informed by testing or otherwise. But I do think we need a new way of thinking about design on the web.

  2. Luke — thanks for your comments. Yes, I do think intuition can be unreliable, and yes, I do think the evaluation methods I recommended here can be be just as flawed as traditional testing. This is why I recommended a process that uses a combination of these methods. I said: “Consider a revised workflow that begins with an expert-level heuristic evaluation used in conjunction with informal testing methods, followed by informal and formal testing. More specifically, consider using online tools and paid services to investigate hunches, then use more formal methods to test and validate revised solutions that involve a designer’s input.”

    You should definitely do anything you can to validate design decisions once they’re implemented. That’s always true. Metrics analysis is one good way to do that (provided the metrics are put in context of goals, of course). I certainly didn’t mean to suggest you can spot important issues and automatically make millions of dollars as a result (though it has been done, albeit rarely and always with a little luck).

    Design is an ongoing experiment. It’s never truly finished. It’s never truly perfect. Put your best ideas out there, measure them, and revise or rethink as needed.

    To your question about online tools: I don’t think 5-second tests are meaningful in and of themselves; I’m not sure any evaluation method is meaningful all by itself. Here, I’ve recommended using them to investigate hunches. If you already identify what you think is a usability issue and results from a 5-second test support or contradict what your belief, that’s a useful insight. In other words, use these tools to bear out a hypothesis, not as a way to discover a hypothesis. The difference is important.

    Thanks again for your comments. Cheers!

  3. Great article, Robert. User experience engineering is an art just like writing and graphic design. You can follow all the rules, but it takes a subjective element to get it right for human consumption. Metrics help by making the subjective objective. They validate or invalidate subjective feelings and decisions.

    “Silverback”:http://silverbackapp.com/ is a tool worth mentioning because it aids in the observation of user experiences not just report metrics.

  4. I don’t agree with your conclusion of the Molich tests. It’s not usability testing that you should blame, but the usability companies that participated. Maybe they are not that good.

    Or maybe the method itself. Maybe usability as a science is not that exact as some people hope.

    Imagine you have a huge garden and you invite 10 companies to maintain it. Will the result be the same?

    Definitely not. Some differences will be the result of a bad knowledge of gardening. Or lazy employees. Of gardeners not knowing the business. I think most of the “˜mistakes’ made, would be the result of one of those elements.

    Other differences will be the result of a different approach. And who is really capable of telling that those differences ‘are’ mistakes, things that the gardeners missed?

    Gardening is not an exact science.

    Neither is usability. (Whatever method you use.) There will always be differences.

  5. If you’re going to do usability testing, do it thoughtfully, thoroughly, and by collaborating with the team. There is a lot of art involved in user research — it isn’t science. But a usability test well done can reveal amazing insights that you can’t get any other way, even if you have observed hundreds of people using designs.

    I have a lot to say about usability testing. Some of what I have to say I said in this periodical, just a couple of weeks ago: http://www.alistapart.com/articles/preview/usability-testing-demystified/.

    (If that’s not enough for you, check out my blog: http://usabilitytestinghowto.blogspot.com/. If *that’s* not enough, come to my session at UI 14, ‘Mastering The Art of User Research’ http://www.uie.com/events/uiconf/2009/program/#chisnell)

  6. *ianlotinsky:* Thanks for the kind words, and yes, I think Silverback is a wonderful tool. Thanks for mentioning it.

    *Dana Chisnell:* I agree — usability testing can, and does, reveal amazing things you can’t get any other way. (Full disclosure: I consulted with Dana prior to writing this article to get her expert perspective on the subject.)

    *KarlGilis:* Interestingly, I think you and I agree more than disagree here. I wonder if you’d consider reading the article a second time with your initial reaction in mind.

    First, I didn’t “blame” usability testing for anything here (I’m not sure if you think I was trying to place blame or if you were only using that word for the sake of your gardening analogy). To the contrary, I frequently sing its praises. I’ve simply posited in this article that the main benefits of testing appear to be different than what is commonly proposed as its core purpose: determining what a team’s priorities should be. In 2003, when asked how teams could confidently conclude they were addressing the right issues, “Molich himself said”:http://www.uie.com/articles/molich_interview/, “It’s very simple: They can’t be sure!”

    Second, please note that I didn’t say anyone made “mistakes” in their evaluations, just that 17 teams in CUE-4 came up with 17 different evaluations, and 9 teams in CUE-2 came up with 9 different evaluations. Indeed, testing is not as scientific as one might hope — that was exactly my point. It is only natural multiple teams will achieve different results — perhaps even _obvious_ — but typical public opinion is that usability testing is a scientific method for determining all the usability issues in a design. The CUE tests clearly contradict that notion.

    I hope this clears up any confusion.

  7. i’m pretty skeptical about the 5-second test. what could you hope to discover from a test like this. i think it could only be useful as a very narrow test, ie what do users notice first. anyone found it genuinely useful?

  8. Thanks for the article. While agree with most of what you’ve said, but couldn’t you pick a better title? I’m getting linked to this article by people who know me (since I specialize in this field) who think that it’s against usability testing.

    Anyway, results of usability testings can vary greatly depending on few factors such as participants, tasks given to participants, the experience of the moderator…etc, so of course you’ll get different results. That doesn’t mean that usability testing is ineffective. It just means that you need to be careful and understand what you’re doing.

  9. roginator: That’s exactly what it’s for, actually. You can use it to determine what people think is the primary purpose of a page, to see if they can tell in five seconds, or ask them how they’d perform a task and then show the screen for 5 seconds and see if they can spot what to click. It’s very useful for quick and dirt cheap feedback on a page with a single purpose. And using http://www.fivesecondtest.com, you can get a ton of responses really quickly.

  10. Mashhoor: Ha! Sorry if it’s causing you trouble. I’ve heard a few people say they think the title is inappropriate. But the article is in fact about a myth: that usability testing is good for determining what to focus on next. It also talks about why different teams produce different results, and what a better (cheaper, more effective) process might look like, so the title is indeed appropriate. That said, I knew it would probably get exactly this type of reaction, and after measuring the pros and cons, I decided the potential controversy could only draw attention to it.

    Do please point out to your colleagues, though, that I never said testing was ineffective. To the contrary, I said it was very effective, just for different reasons. I wholeheartedly support running usability studies. If anything, do them more. Just don’t do them with the goal of determining what to focus on next, because they’re wholly ineffective at accomplishing that.

    Cheers!

  11. Robert, from some of your comments I understand that the “myth” you are referring to in the article title is that usability research is good for “determining what to focus on next.” But I didn’t really get that from the article itself. Are you saying that usability testing and evaluation (BTW, are you equating the two for the purposes of your thesis?) are bad methods for deriving high-level strategic guidance? I would definitely agree with that, but mainly because the method focuses on too-granular issues and is limited in terms of research participants. But then what is the significance of the Molich story and the other examples you cite of, quite frankly, faulty thinking / planning? These stories are worrisome, for sure, and worthy of further study / discussion in and of themselves. But I’m not sure that they say anything positive at all about usability methodologies – regardless of the purpose or intent. In fact, I hope none of my clients read your title and the first few paragraphs alone…and leave thinking that usability research is unreliable!

    Some other questions: Doesn’t your Blink reference support the idea that a good usability expert can provide value? Does that mean that the Molich evaluators were just incompetent? Also, what about just using usability research for its intended purpose – to identify specific design problems that impede user success and / or fail to encourage behaviors that the site wants to encourage (engagement, exploration, interaction, etc.)? Is this a “good” use or a “bad” use of the methodologies?

    I know there must be more to this than “use the right tool for the job” and use it properly… but I confess that I’m not seeing it. Help?

  12. The biggest question mark for me is how valid is the usability data when it was produced using subjects who are not the real users of the application.
    For instance, you can tell me to buy something from a website(me as a usability test subject), but even if the direction informs me I’m buying a product, if I’m not the real customer I may not really understand what I’m looking for. There’re things that a true customer considers before making a purchase or not, that a usability subject may bypass, in effect inventing the truth about the flow of a real-life user. To spend additional funds and change the direction of a site altogether based on the ‘data’ of a non-user test subjects, how valuable has it proven to be?

  13. Thanks for a great article, Robert – highly interesting reading!

    I get the feeling that conclusions from tests often are drawn too early in order to find some kind of business short-cut. This should certainly be good and educative reading for them… 🙂

  14. But in this day of USA Today attention spans, and especially given our discipline’s struggle for respectability and acceptance, there is danger in the titillating but misleading article title or the carelessly arrived at, but well written, conclusion.

    I write in reaction to “The Myth of Usability Testing” by Robert Hoekman Jr. (http://www.alistapart.com/articles/the-myth-of-usability-testing/).

    Misleading title # 1: The article title implies either that a) all of usability testing is a myth, or b) there is only one myth associated with usability testing. As “Mashhoor” asked in his/her post to the discussion about the article, “. . . couldn’t you pick a better title? I’m getting linked to this article by people who know me (since I specialize in this field) who think that it’s against usability testing.” Indeed, Hoekman Jr. says, in discussion item #10, “I knew [the title] would probably get exactly this type of reaction, and . . . I decided the potential controversy could only draw attention to [usability testing]. . . . I wholeheartedly support running usability studies.” I hope all the readers who think the title might suggest otherwise choose to read this far.

    Sloppy or misleading conclusion # 1: In discussing Molich’s CUE-2 test, Hoekman Jr. says, “Collectively, the teams reported 340 usability problems. However, only nine of these problems were reported by more than half of the teams. And a total of 205 problems—60% of all the findings reported—were identified only once. Of the 340 usability problems identified, 61 problems were classified as “˜serious’ or “˜critical’ problems.”
    “Think about that for a moment.”
    “For the Hotmail team to have identified all of the “˜serious’ usability problems discovered in the evaluation process, it would have to have hired all nine usability teams.”
    In stark contrast to Hoekman Jr.’s conclusion that usability testing can’t possibly be cost-effective is Molich’s own conclusion: “Realize that single tests aren’t comprehensive. They’re still useful, however, and any problems detected in a single professionally conducted test should be corrected” (http://www.dialogdesign.dk/CUE-2.htm). Also, in summarizing CUE-4 (http://www.dialogdesign.dk/CUE-4.htm), Molich says: “Many of the teams obtained results that could effectively drive an iterative process in less than 25 person-hours. Teams A and L used 18 and 21 hours, respectively, to find more than half of the key problem issues, but with limited reporting requirements.”

    Misleading title # 2: First major header — “Why usability evaluation is unreliable.” Even if some usability evaluation is unreliable — and given the low barriers to entry for the field of usability engineering, who would be surprised? — that doesn’t mean all usability evaluation is unreliable. Indeed, Hoekman Jr. goes on in this section to describe BAD usability evaluation (e.g., “Right Questions, Wrong People, and Vice Versa”). With this I agree totally — bad usability evaluations are unreliable, and are just generally, um, bad. I wonder if a better header for this section might have been “Some things that lead to unreliability of usability evaluations”? Or maybe “Good methods gone bad”?

    Sloppy or misleading conclusion # 2: “Usability evaluations are good for a lot of things, but determining what a team’s priorities should be is not one of them.”

    Allow me to observe that usability evaluations are also poor for Julienning fries — for that I’d recommend a Veg-o-Matic. For establishing your team’s priorities, I’d recommend, oh, some sorta business process. But if your goal is to identify and prioritize potential problems your users may have with your product or site design — well then, usability evaluation can kick Veg-o-Matic ass. Which brings me to the best part of the Hoekman Jr. article . . .

    Great, representative illustration # 1 — the graphic at the head of the article, drawn by Kevin Cornell, showing a hammer resting against a bent and undriven screw. EXACTLY. Here, a hammer is the wrong tool for the job. There are many jobs for which usability evaluation is the wrong tool, but, as with the hammer, many for which it is the right tool.

    Sloppy or misleading conclusion # 3: “It’s only natural that existing users perform tasks capably and comfortably despite poor task design. After all, the most usable application is the one you already know. But this doesn’t mean poor designs should not be revamped. Rather, to adapt to and harness the power of usability testing, current users should be brought in to test new ideas—ideas that surface from expert evaluation and collaboration with designers to create new solutions.” Yes, and non-current-but-still-representative users may be brought in, at any time, to evaluate old and new interfaces. Why the focus on only current users? If one tested only current users, it would be another example of “the wrong people for the right question.”

    Wheel Rediscovery # 1: “To identify problems on which to focus, these teams, and yours, can take a variety of approaches. Consider a revised workflow that begins with an expert-level heuristic evaluation used in conjunction with informal testing methods, followed by informal and formal testing. More specifically, consider using online tools and paid services to investigate hunches, then use more formal methods to test and validate revised solutions that involve a designer’s input.” Yes, this sounds like a fairly thorough course of User-Centered Design (UCD) (see Vredenburg, Isensee, and Righi, 2002), though there are earlier steps of user-based requirements gathering that are also important. (Though it seems odd to parry “Usability evaluation may be too costly” with “Go with a heuristic evaluation and informal methods, plus some more informal and formal testing.”) Molich, in his CUE-2 summary, offers “Use an appropriate mix of methods.”

    Odd, unsubstantiated claim # 1: “Here are several tools that can be used with a heuristic evaluation to identify trouble spots: Five-second tests: . . . Click stats: . . . Usability testing services: . . .Click stats on screenshots: . . . . In handling usability projects in this way, teams will identify priorities and achieve better outcomes, and can still gain all the benefits of being actively involved with usability tests.” So, heuristic evaluation plus these remote, unmoderated testing tools yield the same benefits as usability testing? I wonder. It’s an empirical question, and in my opinion it’s the next big question for our field — the empirical comparison of the value of usability engineering methods; which methods at which points in the development cycle of which types of user interfaces? (Alas, so far the National Science Foundation doesn’t agree with me, that answering this question is worthy of funding.)

    Odd, but widely-shared misconception #1: “Obviously, not every team or organization can bear the expense of usability testing.” Which teams would that be? For which teams is it OK to “just get something out there and let our first users be our first test participants”? (I am NOT quoting Hoekman Jr., here — rather, it’s a snarky but too-often-deserved characterization of development teams’ approach.) Which teams are OK with the potential costs of a post-ship rework of the product, PLUS the alienating of those users who struggled to learn how to interact with that first design, given that “After all, the most usable application is the one you already know”? Which teams (and ya’ gotta be able to identify “˜em in advance, right?) are going to be those teams that happen to get the design right the first time?

    So, to summarize, in my should-be-humbler opinion:
    – yes, usability evaluations can be pursued at the wrong time, and can be performed poorly even when the timing is good;
    – but that is true of any method or tool in software (or any) engineering, and no reason for criticism of the method itself;
    – usability evaluation, applied and conducted well, IS a tried-and-true technique for identifying potential usability problems;
    – but maybe not all the problems; and so
    – yes, we need to get better at choosing and applying usability engineering methods.

    I’m workin’ on that.

  15. Though the discussant gets no feedback on this (and the title does not appear in the preview when it is cut-and-pasted from the comment itself), there’s a limit to the length of a message title. For my previous post the intended title was: “Fish gotta swim, birds gotta fly. And bloggers gotta blog.”

  16. Randolph: Thanks for your comments. Unfortunately, it would take more time than I have available to address everything you brought up, so I’ll have to leave it up to readers to parse it all and form their own opinions, but there are a few points I feel I must address.

    1. Regarding “Sloppy or misleading conclusion # 1”: Molich’s conclusion isn’t at all in contrast to my own conclusion. I said the Hotmail and Hotel Penn teams would’ve had to hire all 9 and 17 teams, respectively, to identify all the issues spotted during the CUE experiments. Molich, in different terms but with effectively the same message, said “single tests aren’t comprehensive.”

    2. Regarding “Sloppy or misleading conclusion # 2”: Of course you can identify problems with a design through testing, but by using the method to prove out a hypothesis, not by using it as a discovery tool. You took the statement out of context. In context of the surrounding paragraphs, you can see that the statement is about the ineffectiveness of determining what a team’s priorities should be _when testing is used as a discovery method_.

    3. Regarding “Odd, unsubstantiated claim # 1”: Again, you’ve taken this out of context. A benefit of testing I spoke of in the article is that of feeding a designer’s instincts. Informal testing methods can most definitely provide that benefit. And as Molich clearly demonstrated, full usability studies are no more predictable or consistent between teams than heuristic evaluations, so yes, teams who are trying to determine priorities can gain exactly that benefit through informal methods just the same as with formal methods. Neither is more correct than the other.

    There are many myths of usability testing — I’d need a much, much longer article to cover them all. And the fact is, since human beings are involved in every last usability study performed, and no study is an exact or perfect process (because it can’t be), the results are bound to be wildly inconsistent. A widely-held belief, though, is that testing is _scientific_. Clearly, it’s not. It’s important that people understand myths like this one before throwing significant amounts of money and time at a method that won’t necessarily work for them.

    Thanks very much for joining in the discussion. I love that this topic has sparked so much debate. It’s exactly what we need.

  17. As a certified human factors engineering professional CHFP and over 30 years experience with complex usability issues your piece grossly miss-represents the intent and structure of that from of usability analysis known as “heuristics”. To those with a serious background in usability the studies you mention are known to be grossly misleading and poorly executed. Finally, for the record Jacob Nielsen DID NOT invent heuristics. The process was well understood and used successfully in many military applications before JN was born.

    Charles L. Mauro CHFP
    President/Founder
    MauroNewMedia

  18. Thanks for writing such a provocative article. While I agree that usability testing isn’t the right tool to identify the answer to every question about an interface, I’m not really sure I follow your logic here.

    In the examples you cite, the research teams made some pretty huge recruitment gaffes. Clearly, if you test with the wrong audience, and ask them the wrong questions, your findings aren’t going to be worth shit. But that doesn’t mean the method is lacking; the implementation is.

    Do you have any anecdotes illustrating the method’s shortcomings that DON’T involve research teams that made some serious newbie failures?

    I also don’t understand your assertion that:

    bq. “… usability testing fails wholly to do what many people think is its most pertinent and relevant purpose—to identify problems and point a team in the right direction …”

    Setting aside your examples of poorly run testing scenarios, I really don’t see how you can make this kind of assertion. It’s been my experience that usability testing is a terrific method to identify problems in an existing interface. Am I misunderstanding your point? Could you clarify what you mean?

    After reading this article multiple times now, the main point I am left with is this: Don’t hire usability testers who don’t know what they’re doing. I wholeheartedly agree with this sentiment. But as far as I can tell, there’s no clear evidence given here to justify using inflammatory phrases like “the myth of usability testing” or “why usability evaluation is unreliable”.

    I just don’t buy it.

  19. Nice article, having studied psychology I know a lot about field studies, experiments and what to look for in them – for example, these usability tests may not have been in the right conditions so whilst they _may_ be correct do a degree, some of these usability problems may not be _real_ problems when used by a normal person.

    I’m not saying that the job these people do is unecessary, I’m saying that it should always be taken with a pinch of salt and analysed further.

  20. I am not sure what you mean by usability testing cannot drive team priorities? Your explanation is all about why BAD usability testing cannot drive priorities. I think I failed to see where you spoke of GOOD usability testing to drive priorities.

    But I do agree with you, usability testing should be in context. If your usability test cases capture the business needs properly, then it can direct the development efforts in making that test pass (TDD).

  21. Usability tests are great at telling a team what direction they should not pursue, but probably not much else. Unfortunately that is perhaps the most important information regarding creative direction that “experts” may ever hope to receive that is not often appreciated strictly with this regard. Web developers are, by the way, experts at knowing what the customer wants, which is why they are so good at telling the customer what they want.

    My employer subscribes to usability tests that provide incredible feedback. How scientific is that information and how wonderful are those picky details? I don’t know. The evidence does not suggest a decisive direction to pursue, but it is quick to tell you when you are wrong in comparison to nearly identical expectations from competitor websites. If you are wrong over and over… eventually a patter should emerge of what you should NOT be doing. I find that information to be of profound value, although it is commonly in directly conflict with expectations of what usability should be.

  22. I agree with this article that Usability Testing isn’t the ultimate standard in catching issues and solving problems on the web. However, I’ve found in my own experience that testing can provide some insights into how your website is perceived, as well as being able to let you separate yourself from your own company jargon.

    We used testing on a redesign for a university in Philadelphia, and some of the best insight we gained was from what users wanted to get to first, second, etc. We also gained insight into the application process we had built, and was able to rewrite instructions in layman’s terms, instead of the “advertising” jargon we had been using.

  23. You can’t just sit there listening to everyone’s comments. Many users have a terrible sense of aesthetics or want the application to be tailor-made for them. _Every_ site and application needs to undergo usability testing of some sort, but more important is having the services of a designer (and hopefully developers too) who has usability patterns and standards down to a T. A designer with the right eye can pick the true usability flaws out of the sea of personal preferences expressed in during usability testing.

  24. Aside from the fact that I would absolutely believe that a typical Microsoft application like hotmail has AT LEAST 300 usability problems, there are some other oddities about this study.

    The reviewers of the study mention that there were reporting problems from the teams, as well as being pretty fine grained about wether ‘problems’ were in fact the same or not. It almost difficult to trust the outcomes of these studies.

    Also I find it hard to believe that if you continue to expand usability studies you won’t be able to find a correlation with major problems. Most studies are able to zoom right in to 1-4 major problems right away. This study did in fact that there were 9(?) problem that were reported by more than a few teams. This makes perfect sense. You need to prioritize and fix major problems.

    I agree that you need to understand what your expectations are from these kinds of studies. You looking for ‘usability’ problems, not people’s opinions. If you want opinions go have a code review. If people are unable to complete a well written task, well, then you have a problem – which is why these studies are run.

    It’s also my opinion that bringing in current users of a system is a problem. Even sites that require a lot of domain knowledge are able to be tested with first time users and a well written script. Bringing people back to test again is usually not a good idea because (like you mentioned) they’ve become familiar already with your screwed up navigation.

  25. Nice article and I do agree with many points. We find that the best method is to have a group of 10-15 external users all with different tasks to perform, e.g. buy a pair of jeans, sign up to the newsletter, find a course, etc. This way you get a good cross section and then you sit down with your creative and development teams and analyse the data. Then we would make any recommendations for design/fucntionality changes going forward.

  26. If you’re testing an interface for a product that is not strongly influenced by the user’s personal context or emotional state, then testing in a lab will yield decent results. If you’re trying to measure the persuasive power or conversion potential of your site (which should be pretty high on your list of research objectives if you’re running an ecommerce site), then lab-based testing is a complete waste of time. There are a bunch of cost-effective alternatives that can help you identify roadblocks to conversion in real-time.

  27. “Page views and time-spent-per-page metrics, while often foolishly considered standard measures of site effectiveness, are meaningless until they are considered in context of the goals of the pages being visited.

    Is a user who visits a series of pages doing so because the task flow is effective, or because he can’t find the content he seeks? Are users spending a lot of time on a page because they’re engaged, or because they’re stuck? While NYTimes.com surely hopes readers will stay on a page long enough to read an article in full or scan all its headlines, Google’s goal is for users to find what they need and leave a search results page as quickly as possible. A lengthy time-spent metric on NYTimes.com could indicate a high-quality or high-value article. For Google’s search workflow, it could indicate a team’s utter failure.”

    > You seem to be confusing analytics with usability. The whole purpose behind a think-aloud test is to uncover what’s behind this sort of thing. BTW, that’s all that Crazy Egg, Chalkmark, and five second test tell you too. User Testing is a real usability test, but without any ability to follow up. In that sense, it’s inferior to a moderated test.

    “And interestingly, many of the most compelling usability test insights come not from the elements that are evaluated, but rather those not evaluated. They come from the almost unnoticeable moments when a user frowns at a button label, or obviously rates a task flow as easier than it appeared during completion, or claims to understand a concept while simultaneously misdefining it. The unintended conclusions—the peripheral insights—are often what feed a designer’s instincts most.”

    > Most experienced facilitators ignore these in favor of verbalizations. And if users exhibit some sort of body language and don’t verbalize, experienced facilitators prompt them to do so (“What are you thinking?”)

    “Finally, while testing alone is not a good indicator of where a team’s priorities should lie, it is most certainly part of the triangulation process. When put in context of other data, such as project goals, user goals, user feedback, and usage metrics, testing helps establish a complete picture.”

    > User feedback and usage metrics cannot be used if the system hasn’t been put into production yet. Usability testing is usually done pre-release, to get feedback on something without it’s being exposed to the whole world. Also, evals and tests, if done properly consider business and user goals in the tasks they test and the things they look for.

    “Without this context, however, testing can be misleading or misunderstood at best, and outright damaging at worst. This is also true for non-testing-based evaluation methods, such as heuristic reviews.”

    > Actually, it’s not the context so much as the things you covered earlier — basically, inexperienced usability practitioners.

    “There is a catch to all of the preceding arguments, however: They revolve around the notion that testing should be used primarily to identify problems with existing designs. This is where teams get into trouble—they assume testing is worth more than it truly is, resolve to address problems based purely on testing data, and revise strategies based entirely on comments made by test participants.”

    > Usability testing doesn’t prove anything. It’s meant to inform your judgment. You still have to make a decision. Hopefully, there is now evidence (numbers, quotes, etc.) that you can use to make an *informed* decision.

    “As we’ve seen, test results and research can point teams toward solutions that are not only ill-advised, but in direct conflict with their goals.”

    “usability evaluation is unreliable”

    “While usability testing fails wholly to do what many people think is its most pertinent and relevant purpose—to identify problems and point a team in the right direction.”

    “Test for the right reasons and you stand a good chance of achieving a positive outcome. Test for the wrong ones, however, and you may not only produce misleading results, but also put your entire business at risk.”

    > These are pretty strong claims. Given that, you really need to back them up. You cite the CUE studies, but there’s a lot to these studies. It’s actually rather complicated what he’s really saying, and trying to present it all in this way is (yes) very attention-getting, but also very wrong.

    “Asked how development teams could be confident they are addressing the right problems on their websites, Molich concluded, “It’s very simple: They can’t be sure!”

    > Well, here’s what Molich has also said, from the CUE site:

    >> Six – or even 15 – test participants are nowhere near enough to find 80% of the usability problems. Six test participants will, however, provide sufficient information to drive a useful iterative development process.

    >> The limited overlap may be a result of the large number of usability problems in [the system being tested]. It could also be due to the different approaches to usability testing that the participating teams took – in particular, the selection of different usability test scenarios.

    >> Realize that there is no foolproof way to identify usability flaws. Usability testing by itself can’t develop a comprehensive list of defects. Use an appropriate mix of methods.

    >> Place less focus on finding “all” problems. Realize that the number of usability problems is much larger than you can hope to find in one or even a few tests. Choose smaller sets of features to test iteratively and concentrate on the most important ones.

    >> Realize that single tests aren’t comprehensive. They’re still useful, however, and any problems detected in a single professionally conducted test should be corrected.

    >> Increase focus on quality and quality assurance. Prevent methodological mistakes in usability testing such as skipping high-priority features, giving hidden clues, or writing usability test reports that aren’t fully usable.

Got something to say?

We have turned off comments, but you can see what folks had to say before we did so.

More from ALA