A List Apart

Menu

Illustration by Geri Coady

Collaborative User Testing: Less Bias, Better Research

I’ve always worked in small product teams that relied on guerrilla user testing. We’d aim to recruit the optimal number of participants for these tests. We’d make sure the demographic reflected our target audience. We’d use an informal approach to encourage more natural behavior and reduce the effect of biases participants could be prone to.

Article Continues Below

But you know what we almost never talked about? Ourselves. After all, we were evaluating work we had personal and emotional involvement in. I sometimes found myself questioning, how objective were our findings, really?

It turns out, they may not have been.

In “Usability Problem Description and the Evaluator Effect in Usability Testing,” Miranda G. Capra identifies a tendency in the UX community to focus on users when talking about testing, while seldom talking about the role of evaluator. The assumption is that if the same users perform the same tasks, the reported problems should be the same—regardless of who evaluates them.

But when Capra studied 44 usability practitioners’ evaluations of pre-recorded sessions, this wasn’t observed. The evaluators, made up of experienced researchers and graduate students, reported problems that overlapped at an unexpectedly low rate—just 22 percent. Different evaluators found different problems, and assigned different levels of severity to them. She concluded that the role of evaluator was more important than previously acknowledged in the design and UX community.

If complete and objective results couldn’t be achieved even by usability professionals who were evaluating the same recordings, what can we expect from unspecialized teams planning, conducting, and evaluating user testing?

Bias is unavoidable

As people fully immersed in the project, we are susceptible to many cognitive biases that can affect outcomes at any stage of research—from planning to analysis. Confirmation bias among inexperienced evaluators is a common one. This leads us to phrase questions in a way that is more likely to confirm our own beliefs, or subconsciously prioritize certain responses and ignore others. I’ve done it myself, and seen it in my colleagues, too. For example, I once had a colleague who was particularly keen on introducing search functionality. Despite the fact that only one respondent commented on the lack of search, they finished the testing process genuinely convinced that “most people” had been looking for search.

We all want our research to provide reliable guidance for our teams. Most of us wouldn’t deliberately distort data. But bias is often introduced unknowingly, without the researcher being aware of it. In the worst-case scenario, distorted or misleading results can misinform the direction of the product and provide the team with false confidence in their decisions.

Capra’s research and other studies have shown that bias commonly occurs at the planning stage (when drafting test tasks and scenarios), during the session itself (when interacting with the participants and observing their behavior), and at the analysis stage (when interpreting data and drawing conclusions). Knowing this, my team at FutureLearn, an online learning platform, set out to reduce the chance of bias in our own research—while still doing the quick, efficient research our team needs to move forward. I’d like to share the process and techniques we’ve established.

Take stock of your beliefs and assumptions

Before you begin, honestly acknowledge your personal beliefs, particularly if you’re testing something you have “strong feelings” about. Register those beliefs and assumptions, and then write them down.

Do you think the Save button should be at the top of the form, rather than at the end, where you can’t see it? Have you always found collapsing side menus annoying? Are you particularly pleased and proud of the sleek new control you designed? Are you convinced that this label is confusing and that it will be misinterpreted? By taking note of them, you’ll stay more aware of them. If possible, let someone else lead when these areas are being tested.

Involve multiple reviewers during planning

At FutureLearn, our research is highly collaborative—everyone in the product team (and often other teams) is actively involved. We try to invite different people to each research activity, and include mixed roles and backgrounds: designers, developers, project managers, content producers, support, and marketing.

We start by sharing a two-part testing plan in a Google Doc with everyone who volunteered to take part. It includes:

  • Testing goals: Here we write one to three questions we hope the testing will help us answer. Our tests are typically short and focused on specific research objectives. For example, instead of saying, “See how people get on with the new categories filter design,” we aim for objective phrasing that encourages measurable outcomes, like: “Find out how the presence of category filters affects the use of sorting tabs on the course list.” Phrasing the goals in this ways helps focus our evaluators’ (mis)interpretation.
  • Test scenarios: Based on the goals, we write three or four tasks and scenarios to go through with participants. We make the tasks actionable and as close as possible to expected real-life behavior, and ensure that instructions are specific. With each scenario, we also provide context to help participants engage with the interface. For example, instead of saying: “Find courses that start in June,” we say something along the lines of: “Imagine you’ll be on holiday next month and would like to see if there are any courses around that time that interest you.”

In one past session, where participants were required to find specific courses, we used the verbs “find” and “search” in the first draft. A colleague noticed that by asking participants to “search for a course,” we could be leading them toward looking for a search field, rather than observing how they would naturally go about finding a course on the platform. It may seem obvious now that “search” was the wrong word choice, but it can be easy for a scenario drafter who is also involved in the project to overlook these subtle differences. To avoid this, we now have several people read the scenarios independently to make sure the language used doesn’t steer responses in a particular direction.

Perform testing with multiple evaluators

In her paper, Capra argues that having multiple observers reduces the chance of biased results, and that “having more evaluators spend fewer hours is more effective than having fewer evaluators spend more hours.” She notes that:

Adding a second evaluator results in a 30-43% increase in problem detection… Gains decreased with each additional evaluator, with a 12-20% increase from adding a third evaluator, and a 7-12% increase for adding a fourth evaluator.

In my past experience, the same small group of people (or a single person) was always responsible for user testing. Typically, they were also working on the project being tested. This sometimes led evaluators to be defensive—to the point that the observer would try to blame a participant for a design flaw. It also sometimes made the team members who weren’t involved in research skeptical about undesirable or unexpected results.

To avoid this, we have several people oversee all stages of the process, including moderating the sessions. Usually, four of us conduct the actual session — two designers, a developer, and someone from another interest (e.g., a product manager or copywriter). It is crucial that only one of the designers is directly involved in the project, so the other three evaluators can offer a fresh perspective.

Most importantly, everyone is actively involved, not merely a passive observer. We all talk to participants, take notes, and have a go at leading the session.

During a session, we typically set up two testing “stations” that work independently. This helps us to collect more diverse data, since it allows two pairs of people to interview participants.

FutureLearn staff gathered around a research participant at a user testing station.
Multiple evaluators participate in each user research session, which take place at stations like these.

The sessions tend to be short and structured around the specific goals identified in the plan. The whole process lasts no more than two hours, during which the two stations combined talk to 10 to 12 participants, for about 10 minutes each.

Bias can take many forms, including the manipulation of participants through unconscious suggestion, or selection of people who are more likely to exhibit the expected behavior. Conducting testing in a public place, like the British Library, where our office is conveniently located, helps us ensure a broad selection of respondents who fit our target demographic: students, professionals, academics, and general-interest learners.

Have multiple people analyze results

Data interpretation is also prone to bias: cherry-picking findings and being fixated on some responses while being blind to others are common among inexperienced evaluators.

Analyzing the data we gather is also a shared task in our team. At least two of us write up the notes in Google Docs and rewatch the session videos, which we record using Silverback.

Most of our team doesn’t have experience in user testing. Being given a blank sheet of paper and asked to make sense of their findings would be intimidating and time-consuming—they wouldn’t know what to look for. Therefore, the designer responsible for the testing typically sets up a basic Google form that asks evaluators a series of fact-based questions. We use the following structure:

  • General questions: The participant’s name, age group, level of technical competence, familiarity with our product, and occupation. We ask these questions right at the beginning, along with having people sign a consent form.
  • Scenario performance: This section contains specific questions related to participants’ performance in each scenario. We typically use a few brief multiple-choice questions. Since our tests are short, we usually provide two to four options for each answer, rather than complex rating scales. Evaluators can then provide additional information or comments in an open text field.
Two sample evaluator questions: Found courses starting in June? (options: found easily, struggled, or gave up or ran out of time), and Which course did they select? (options: future course, Current course, or Neither (didn't join)).
Excerpts from the Google form each evaluator fills out while watching session videos.

These simple forms help us reduce the chance of misinterpretation by the evaluator, and make it easier for inexperienced evaluators to share their observations. They also allow us to support our analysis with quantitative data—e.g., how many people experienced a problem and how often? How easy or difficult was a particular task to complete? How often was a particular element used as expected, versus ignored or misinterpreted?

Using these forms, an evaluator can typically review all five of a station’s participants in about an hour. We do this as soon as possible— ideally on the same day as the sessions, while the observations are still fresh in our memories, and before we get a chance to overanalyze them.

Once evaluators submit their forms, Google Docs creates an automatic response summary, which includes raw data with metrics, quotes, performance for each task, and other details.

Based on these responses, recorded videos, and everyone’s written notes, the designer responsible synthesizes the team’s findings. I usually start by grouping all the collected data into related themes in another spreadsheet, which helps me see all the data at a glance and ensure nothing gets lost or ignored.

An excerpt of a spreadsheet that groups data from research sessions into themes, such as “Start Dates” and “Terminology.”
Grouping and organizing the data in a spreadsheet makes it easier to see themes and patterns.

At this stage we look for general patterns in observed behavior. Inevitably some outliers and contradictions come up. We keep track of those separately. Since we do research regularly, over time these outliers add up, revealing new and interesting patterns, too.

We then write up a summary of results—a short document that outlines these patterns and explains how they address our research goals. It also contains task performance metrics, memorable quotes, interesting details, and other things that stood out to the team.

The summary is shared with the research team to make sure their notes were included and interpreted correctly. The researcher responsible then puts everything together into a user testing report, which is shared with the rest of the company. These reports are typically short PDFs (no longer than 12 pages) with a simple structure:

  • Goals of testing and tasks and scenarios: Content from the testing plan.
  • Respondents: A brief overview of the respondents’ demographics (based on the General Questions section).
  • Results and observations: Based on the results summary recorded earlier.
  • Conclusions: Next steps or suggestions for how we’ll use this information.

Some teams avoid investing time in writing reports, but we find them useful. We often refer back to them in later stages and share them with people outside the project so they can learn from our findings, too. We also share the results in a shorter presentation format at sprint reviews.

Keep it simple, but regular

Conducting short, light sessions regularly is better than doing long, detailed testing only once in a blue moon. Keeping it quick and iterative also prevents us from getting attached to one specific idea. Research has suggested (PDF) that  the more you invest in a particular route, the less likely you are to consider alternatives—which could also increase your chances of turning user testing into a confirmation of your beliefs.

We also had to learn to make testing efficient, so that it fits into our ongoing process. We now spend no more than two or three days on user testing during a two-week sprint—including writing the plan, preparing a prototype in Axure or Proto.io, testing, analyzing data, and writing the report. Collaborative research helps us keep each individual contributor’s time focused, saving us from spending time filtering information through deliverables and handoffs, and increasing the quality of our learning.

Make time for research

Fitting research into every sprint isn’t easy. Sometimes I wish someone would just hand me the research results so I could focus on designing, rather than data-gathering. But testing your own work regularly can be one of the most effective ways to overcome bias.

The hindsight bias is an interesting example. We become more prone to thinking we “knew things all along” as we grow more experienced, and as our perception of the level of our past knowledge increases. This can lead some designers to believe that experience “reduces the need for usability tests.” The risk, however, is that our design experience can make it harder for us to connect empathetically with our target audience—to relate to the struggles they’re going through as they use our product (that’s also why it’s so hard to teach a subject you’ve gained mastery of).

According to researchers like Paul Goodwin, a professor of management science at the University of Bath, the most effective known way we can overcome hindsight bias is by continuous education (PDF)—particularly when we work hard to gain new knowledge.

Having invested effort to acquire new knowledge, you’re less likely to conclude that you “knew it all along.” In contrast, people perceived they had more prior knowledge when they received new knowledge passively and effortlessly.

Actively engaging in user testing is the most effective way of learning I know. It’s also a great way to avoid arrogance and relate to the people we are building for. Minimizing bias takes practice, honesty, and collaboration. But it’s worth it.

About the Author

8 Reader Comments

Load Comments