Task Performance Indicator: A Management Metric for Customer Experience

It’s hard to quantify the customer experience. “Simpler and faster for users” is a tough sell when the value of our work doesn’t make sense to management. We have to prove we’re delivering real value–increased the success rate, or reduced time-on-task, for example–to get their attention. Management understands metrics that link with other organizational metrics, such as lost revenue, support calls, or repeat visits. So, we need to describe our environment with metrics of our own.

Article Continues Below

For the team I work with, that meant developing a remote testing method that would measure the impact of changes on customer experience—assessing alterations to an app or website in relation to a defined set of customer “top tasks.” The resulting metric is stable, reliable, and repeatable over time. We call it the Task Performance Indicator (TPI).

For example, if a task has a TPI score of 40 (out of 100), it has major issues. If you measure again in 6 months’ time but nothing has been done to address the issues, the testing score will again result in a TPI of 40.

In traditional usability testing, it has long been established that if you test with between three and eight people, you’ll find out if significant problems exist. Unfortunately, that’s not enough to reveal precise success rates or time-on-task measurements. What we’ve discovered from hundreds of tests over many years is that reliable and stable patterns aren’t apparent until you’re testing with between 13 and 18 people. Why is that?

When the number of participants ranges anywhere from 13–18 people, testing results begin to stabilize and you’re left with a reliable baseline TPI metric.

The following chart shows why we can do this (Fig. 1).

A graph showing how TPI scores essentially leveled out upon as more participants were included. — Fig 1: TPI scores start to level out and stabilize as more participants are tested.

How TPI scores are calculated#section2

We’ve spent years developing a single score that we believe is a true reflection of the customer experience when completing a task.

For each task, we present the user with a “task question” via live chat. Once they understand what they have to do, the user indicates that they are starting the task. At the end of the task, they must provide an answer to the question. We then ask people how confident they are in their answer.

A number of factors affect the resulting TPI score.

Time: We establish what we call the “Target Time”—how long it should take to complete the task under best practice conditions. The more they exceed the target time, the more it affects the TPI.

Time out: The person takes longer than the maximum time allocated. We set it at 5 minutes.

Confidence: At the end of each task, people are asked how confident they are. For example, low confidence in a correct answer would have a slight negative impact on the TPI score.

Minor wrong: The person is unsure; their answer is almost correct.

Disaster: The person has high confidence, but the wrong result; acting on this wrong answer could have serious consequences.

Gives up: The person gives up on the task.

A TPI of 100 means that the user has successfully completed the task within the agreed target times.

In the following chart, the TPI score is 61 (Fig. 2).

A pie chart illustrating sample results for Overall Task Performance, and a vertical bar showing Mean Completion Times in comparison with Mean Target Times. — Fig 2: A visual breakdown of sample results for Overall Task Performance, Mean Completion Times, and Mean Target Times.

Developing task questions#section3

Questions are the greatest source of potential noise in TPI testing. If a question is not worded correctly, it will invalidate the results. To get an overall TPI for a particular website or app, we typically test 10-12 task questions. In choosing a question, keep in mind the following:

Based on customer top tasks. You must choose task questions that are examples of top tasks. If you measure and then seek to improve the performance of tiny tasks (low demand tasks) you may be contributing to a decline in the overall customer experience.

Repeatable. Create task questions that you can test again in 6 to 12 months.

Representative and typical. Don’t make the task questions particularly difficult. Start off with reasonably basic, typical questions.

Universal, everyone can do it. Every one of your test participants must be able to do each task. If you’re going to be testing a mixture of technical, marketing, and sales people, don’t choose a task question that only a salesperson can do.

One task, one unique answer. Limit each task question to only one actual thing you want people to do, and one unique answer.

Does not contain clues. The participant will examine the task question like Sherlock Holmes would hunt for a clue. Make sure it doesn’t contain any obvious keywords that could be answered by conducting a search.

Short—30 words or less. Remember, the participant is seeing each task question for the first time, so aim to keep its length at less than 20 words (and definitely less than 30).

No change within testing period. Choose questions where the website or app is not likely to change during the testing period. Otherwise, you’re not going to be testing like with like.

Case Study: Task questions for OECD#section4

Let’s look at some top tasks for the customers of Organisation for Economic Co-operation and Development (OECD), an economic and policy advice organization.

Access and submit country surveys, reviews, and reports.
Compare country statistical data.
Retrieve statistics on a particular topic.
Browse a publication online for free.
Access, submit, and review working papers.

Based on that list, these task questions were developed:

What are OECD’s latest recommendations regarding Japan’s healthcare system?
In 2008, was Vietnam on the list of countries that received official development assistance?
Did more males per capita die of heart attacks in Canada than in France in 2004?
What is the latest average starting salary, in US dollars, of a primary school teacher across OECD countries?
What is the title of Box 1.2 on page 73 of OECD Employment Outlook 2009?
Find the title of the latest working paper about improvements to New Zealand’s tax system.

Running the test#section5

To test 10-12 task questions usually takes about one hour, and you’ll need between 13 and 18 participants (we average 15). Make sure that they’re representative of your typical customers.

We’ve found that remote testing is better, faster, and cheaper than traditional lab-based measurement for TPI testing. With remote testing, people are more likely to behave in a natural way because they are in their normal environment—at home or in the office—and using their own computer. That makes it much easier for someone to give you an hour of their time, rather than spend the morning at your lab. And since the cost is much lower than lab-based tests, we can set them up more quickly and more often. It’s even convenient to schedule them using Webex, GoToMeeting, Skype, etc.

The key to a successful test is that you are confident, calm, and quiet. You’re there to facilitate the test—not to guide it or give opinions. Aim to become as invisible as possible.

Prior to beginning the test, introduce yourself and make sure the participant gives you permission to record the session. Next, ask that they share their screen. Remember to stress that you are only testing the website or app—not them. Ask them to go to an agreed start point where all the tasks will originate. (We typically choose the homepage for the site/app, or a blank tab in the browser.)

Explain that for each task, you will paste a question into the chat box found on their screen. Test the chat box to confirm that the participant can read it, and tell them that you will also read the task aloud a couple of times. Once they understand what they have to do, ask them to indicate when they start the task, and that they must give an answer once they’ve finished. After they’ve completed the task, ask the participant how confident they are in their answer.

Analyzing the results#section6

As you observe the tests, you’re looking for patterns. In particular, look for the major reasons people give for selecting the wrong answer or exceeding the target time.

Video recordings of your customers as they try—and often fail—to complete their tasks have powerful potential. They are the raw material of empathy. When we identify a major problem area during a particular test, we compile a video containing three to six participants who were affected. For each participant, we select less than a minute’s worth of video showing them while affected by this problem. We then edit these participant snippets into a combined video (that we try to keep under three minutes). We then get as many stakeholders as possible to watch it. You should seek to distribute these videos as widely, and as often as possible.

How Cisco uses the Task Performance Indicator#section7

Every six months or so, we measure several tasks for Cisco, including the following:

Task: Download the latest firmware for the RV042 router.

The top task of Cisco customers is downloading software. When we started the Task Performance Indicator for software downloads in 2010, a typical customer might take 15 steps and more than 300 seconds to download a piece of software. It was a very frustrating and annoying experience. The Cisco team implemented a continuous improvement process based on the TPI results. Every six months, the Task Performance Indicator was carried out again to see what had been improved and what still needed fixing. By 2012—for a significant percentage of software—the number of steps to download software had been reduced from 15 to 4, and the time on task had dropped from 300 seconds to 40 seconds. Customers were getting a much faster and better experience.

According to Bill Skeet, Senior Manager of Customer Experience for Cisco Digital Support, implementing the TPI has had a dramatic impact on how people think about their jobs:

We now track the score of each task and set goals for each task. We have assigned tasks and goals to product managers to make sure we have a person responsible for managing the quality of the experience … Decisions in the past were driven primarily by what customers said and not what they did. Of course, that sometimes didn’t yield great results because what users say and what they do can be quite different.

Troubleshooting and bug fixing are also top tasks for Cisco customers. Since 2012, we’ve tested the following.

Task: Ports 2 and 3 on your ASR 9001 router, running v4.3.0 software, intermittently stop functioning for no apparent reason. Find the Cisco recommended fix or workaround for this issue.

Combination of pie charts and browser screenshots, depicting progression of change to the Bug Task Success Rate from February 2012 through December 2014. — Fig 3: Bug Task Success Rate Comparisons, February 2012 through December 2014.

For a variety of reasons, it was difficult to solve the underlying problems connected with finding the right bug fix information on the Cisco website. Thus, the scores from February 2012 to February 2013 did not improve in any significant way.

For the May 2013 measurement, the team ran a pilot to show how (with the proper investment) it could be much easier to find bug fix information. As we can see in the preceding image, the success rate jumped. However, it was only a pilot and by the next measurement it had been removed and the score dropped again. The evidence was there, though, and the team soon obtained resources to work on a permanent fix. The initial implementation was for the July 2014 measurement, where we see a significant improvement. More refinements were made, then we see a major turnaround by December 2014.

Task: Create a new guest account to access the Cisco.com website and log in with this new account.

Graph depicting Success/Failure rates from March 2015 through June 2015 — Fig 4: Success/Failure rates from March 2015 through June 2015

This task was initially measured in 2014; the results were not good.

In fact, nobody succeeded in completing the task during the March 2014 measurements, resulting in three specific design improvements to the sign-up form. These involved:

Clearly labelling mandatory fields
Improving password guidance
Eliminating address mismatch errors.

A shorter pilot form was also launched as a proof of concept. Success jumped by 50% in the July 2014 measurements, but dropped 21% by December 2014 because the pilot form was no longer there. By June 2015, a shorter, simpler form was fully implemented, and the success again reached 50%.

The team was able to show that because of their work:

The three design improvements improved the success rate by 29%.
The shorter form improved the success rate by 21%.

That’s very powerful. You can isolate a piece of work and link it to a specific increase in the TPI. You can start predicting that if a company invests X it will get a Y TPI increase. This is control and the route to power and respect within your organization, or to trust and credibility with your client.

If you can link it with other key performance indicators, that’s even more powerful.

The following table shows that improvements to the registration form halved the support requests connected with guest account registration (Fig. 5).

Bar chart illustrating registration support request numbers for Q1 2014 (1,500), Q2 2015 (679), and Q3 2015 (689). — Fig 5: Registration Support Requests, Q1 2014, Q2 2015, and Q3 2015.

A more simplified guest registration process resulted in:

A reduction in support requests—from 1,500 a quarter, to less than 700
Three fewer people were required to support customer registration
80% productivity improvement
Registration time down to 2 minutes from 3:25.

Task: Pretend you have forgotten the password for the Cisco account and take whatever actions are required to log in.

When we measured the change passwords task, we found that there was a 37% failure rate.

A process of improvement was undertaken, as can be seen by the following chart, and by December 2013, we had a 100% success rate (Fig. 6).

Four pie charts illustration the progression of improvement in success rate from November 2012 (63%), May 2013 (77%), August 2013 (88%), and December 2013 (100%). — Fig 6: Progression of success rate improvement from November 2012 to December 2013.

100% success rate is a fantastic result. Job done, right? Wrong. In digital, the job is never done. It is always an evolving environment. You must keep measuring the top tasks because the digital environment that they exist within is constantly changing. Stuff is getting added, stuff is getting removed, and stuff just breaks (Fig. 7).

Two pie charts, one reporting a success rate of 41% for March 2014 and the other a 100% success rate for July 2014. — Fig 7: Comparison of success rates, March 2014 and July 2014.

When we measured again in March 2014, the success rate had dropped to 59% because of a technical glitch. It was quickly dealt with, so the rate shot back up to 100% by July.

At every step of the way, the TPI gave us evidence about how well we were doing our job. It’s really helped us fight against some of the “bright shiny object” disease and the tendency for everyone to have an opinion on what we put on our webpages … because we have data to back it up. It gave us more insight into how content organization played a role in our work for Cisco, something that Jeanne Quinn (senior manager responsible for the Cisco Partner) told us kept things clear and simple while working with the client.

The TPI allows you to express the value of your work in ways that makes sense to management. If it makes sense to management—and if you can prove you’re delivering value—then you get more resources and more respect.

Design for Amiability: Lessons from Vienna

by Mark Bernstein

Computing was born in a Viennese café. Between 1928 and 1934, while Hitler plotted and Europe crumbled, a motley crew of mathematicians, philosophers, architects, and economists gathered weekly to puzzle out the limits of reason—and invented Computer Science in the process. What made their collaboration possible wasn't just brilliance (though they had plenty). It was amiability: the careful design of a social space where difficult people could disagree without destroying each other. Longtime A List Apart contributing author Mark Bernstein mines this forgotten history for lessons that might just save today's embattled web from its worst impulses. Spoiler: it involves better coffee service and the looming threat of public humiliation.

16 Reader Comments

Hans Spieß says:

September 29, 2016 at 2:53 am

thanks for sharing valuable insights! one question: how do you calculate the tpi? i found a simple calculation including only success rate, optimal completion time and median completin time: sucess rate x ( optimal ct / median ct ), but the factors mentioned in this article need a far more advanced formula?
Gerry McGovern says:

September 29, 2016 at 7:30 am

Glad you found it useful, Mark.

Yes, Hans. It’s a good bit more complicated formula that we’ve been evolving over the years. The formula doesn’t only belong to me so I can’t share it directly. For example, we establish a traget time fore the task. If the actual time for the participant is 2X the traget time, then the impact on the overall score is small. But once it goes obove 3 times, the impact become more severe. But the maximum time penalty is 40. So, for example, if someone completes the task but it takes them a very long time, then they could get a TPI score of 60.

Hope this helps.
Harvard Kid says:

October 1, 2016 at 6:20 pm

I just found this site. It’s awesome!
Bansal says:

October 12, 2016 at 10:03 am

Good read! I would have loved to see more examples/case studies of how it was used. A great point mentioned here is about focussing on important high-priority tasks (like the 80-20 principle).
BeLove says:

October 20, 2016 at 9:57 am

Do you think using un-moderated testing that is recorded and task based (for instance, usertesting.com) where they are encouraged to “think aloud” would be able to achieve the same thing?
BeLove says:

October 20, 2016 at 9:57 am

Do you think using un-moderated testing that is recorded and task based (for instance, usertesting.com) where they are encouraged to “think aloud” would be able to achieve the same thing?
Gerry McGovern says:

October 24, 2016 at 12:18 pm

Hi Bansal,
You’ll get some more examples here.
http://alistapart.com/article/what-really-matters-focusing-on-top-tasks

And I’ve written a book with lots of case stuidies: The Stranger’s Long Neck
https://www.amazon.com/Strangers-Long-Neck-Deliver-Customers/dp/1408114429/ref=sr_1_2?s=books&ie=UTF8&qid=1477326101&sr=1-2
Gerry McGovern says:

October 24, 2016 at 12:19 pm

Hi BeLove,
Unmoderated testing has many benefits but would not be so appropriate for this type of testing. We’re trying to create a management metric here–something you can communicate with confidence to management.

We find unmoderated testing can bring quite a bit of noise into the data. People who are not really the right target audience, even though they might say they are. People who are not that committed to the testing and just run through it. And ‘professional’ test participants–those who take a lot of tests. You have to be very careful.

In unmoderated remote testing, it is very difficult to ascertain whether someone has successfully completed the task or not, and that is a major disadvantage from a design and continuous improvement point of view. In a 2015 study, Measuring Usability found that while 93% of participants said they had completed a set of tasks successfully, only 33% of these tasks were verified as being actual successes.

Also, you still need to analyze the videos / results because the most important thing you do is figure out what’s not working, and how to fix it. You will need an expert carefully watching the participants to see where they’re stumbling, where they’re having trouble.

Best

Gerry
neel patel says:

November 1, 2016 at 11:04 am

Do you think using un-moderated testing that is recorded and task based (for instance, usertesting.com) where they are encouraged to “think aloud” would be able to achieve the same thing?
neel patel says:

November 1, 2016 at 11:06 am

Do you think using un-moderated testing that is recorded and task based (for instance, usertesting.com) where they are encouraged to “think aloud” would be able to achieve the same thing?
neel patel says:

November 1, 2016 at 11:07 am

Do you think using un-moderated testing that is recorded and task based (for instance, usertesting.com) where they are encouraged to “think aloud” would be able to achieve the same thing?
neel patel says:

November 1, 2016 at 11:09 am

Do you think using un-moderated testing that is recorded and task based (for instance, usertesting.com) where they are encouraged to “think aloud” would be able to achieve the same thing?
neel patel says:

November 1, 2016 at 11:10 am

Do you think using un-moderated testing that is recorded and task based (for instance, usertesting.com) where they are encouraged to “think aloud” would be able to achieve the same thing?
Gerry McGovern says:

November 2, 2016 at 12:01 pm

Hi Neel,
If it’s very carefully managed, then unmoderated can work. However, you still need an expert to go over all the videos and identify the customer journeys and patterns, the things that need fixing and improving. The core of the Task performance Indicator is to identify how to improve-what are the most essential things you need to do to improve.
ankita says:

November 17, 2016 at 1:22 am

Great thanks, I’ve been looking for some good web design blogs.
http://www.gtminfotech.com/
Pingback: The return of investment (ROI) of UX for enterprise software and internal applications - WeCatalyze

Got something to say?

We have turned off comments, but you can see what folks had to say before we did so.

More from ALA

“Successful” or “Unsuccessful”: the Post-“Good Design” Vocabulary

by Justin Dauer

Design Dialects: Breaking the Rules, Not the System

by Michel Ferreira

Design systems aren't component libraries—they’re living languages. Rigid adherence to visual rules creates brittle systems that break under contextual pressure. Fluent systems bend without breaking.

An Holistic Framework for Shared Design Leadership

Having both a Design Manager and a Lead Designer on the same team is beautiful, but can be messy. To make it work without creating confusion, overlap, or “too many cooks,” check Michel Ferreira’s Holistic Framework for Shared Design Leadership.

From Beta to Bedrock: Build Products that Stick.

by Liam Nugent

Building towards bedrock means sacrificing some short-term growth potential in favour of long-term stability. But the payoff is worth it: products built with a focus on bedrock will outlast and outperform their competitors, and deliver sustained value to users over time. Liam Nugent shows us how.