A Primer on A/B Testing

by Lara HoganAugust 23, 2011

Published in JavaScript, Usability, User Research

Data is an invaluable tool for web designers who are making decisions about the user experience. A/B tests, or split tests, are one of the easiest ways to measure the effect of different design, content, or functionality. A/B tests allow you to create high-performing user experience elements that you can implement across your site.

Article Continues Below

But it’s important to make sure you reach statistically significant results and avoid red herrings. Let’s talk about how to do that.

What is an A/B test?#section2

In an A/B test, you compare two versions of a page element for a length of time to see which performs better. Users will see one version or the other, and you’ll measure conversions from each set of users. A/B tests help designers compare content such as different headlines, call to action text, or length of body copy. Design and style choices can be tested, too; for example, you could test where to place a sign-in button or how big it should be. A/B tests can even help you measure changes in functionality, such as how and when error messages are shown.

Split testing can also help when you’re making drastic design changes that need to be tempered, such as a homepage redesign. You can pick pieces of the change and test them as you ramp up to the final design, without worrying that a massive change will alienate a user base or cause a large drop in conversions.

Results of A/B tests have lasting impact. It’s important to know which design patterns work best for your users so you can repeat “winning” A/B test results across the site. Whether you learn how users respond to the tone of your content, calls to action, or design layout, you can apply what you learn as you create new content.

Data also plays very well with decision-makers who are not designers. A/B tests can help prevent drops in conversion rate, alienation of a user base, and decreases in revenue; clients appreciate this kind of data. The conversions that you measure could be actual product purchases, clicks on a link, the rate of return visits to the site, account creations, or any other measurable action. Split testing can help your team make decisions based on fact rather than opinion.

Decide what to test#section3

First, you need to decide which page element you would like to test. The differences between A/B versions should be distinct. A small change in color, a minor reordering of words, or negligible changes in functionality may not make good A/B tests, as they would likely not register major differences in the user experience, depending on the size of your user base. The difference between versions should influence conversion rates; and it should be something you’ll learn from for future designs. Great A/B tests could compare:

completely different email subject lines,
offering a package or bulk deal in one version, or
requiring sign-up for one user set and leaving it optional for the other.

Which Test Won offers great inspiration for A/B tests, and includes results as well as the testers’ assessment of why a particular version won. A/B tests should only be done on one variable at a time; if you test more than one difference between versions, it’s impossible to tell how each variable influenced conversions.

At this time, you should also figure out what metric you’ll be comparing between the two versions. A conversion rate is the most-used metric for A/B tests, but there may be other data points you may be interested in. The conversion rate you measure could be the percentage of users who clicked on a button, signed up on a form, or opened an email.

Implement your test#section4

Once you’ve decided on the differences between the A and B versions, you need to set up your A/B test to run on your site. There are many A/B testing tools that you can try, depending upon your medium (website, email), platform (static HTML, dynamic content), or comfort with releasing your site metrics to third-party tools. Which Test Won has a solid list of tools that you can use to create your own A/B tests. You can also create your own home-grown solution. You’ll want to be able to control:

the number of visitors who see each version of the test,
the difference between each version, and
how you measure the effect of each test.

Tracking events with Google Analytics can be helpful if you’re using your own split testing solution. You can set custom variables using Google Analytics that help you track the users that see version A of your test against those who see version B. This may help you decipher additional data beyond your primary conversion rate. For example, did users in different countries have different results than the average user?

To set the custom variables in Google Analytics, add the following line of JavaScript to your page:

_gaq.push([‘_setCustomVar’,1,‘testname’,‘testversion’,2]);

There’s more information on creating custom variables in Google’s documentation. The parts of the above that you want to replace are testname, which will be an identifier for the A/B test you’re running, and testversion, which will indicate whether this is version A or version B. Use names that will be intuitive for you. For example, if I were to run a home page experiment to compare short text to long text, on version A I would use:

_gaq.push([‘_setCustomVar’,1,‘Homepage Content Test’,‘Short’,2]);

On version B I would use:

_gaq.push([‘_setCustomVar’,1,‘Homepage Content Test’,‘Long’,2]);

Collecting this information in Google Analytics will allow you to see more data on the users that see your test than just conversion rate, such as their time on site, number of account creations, and more. To see the these variables in Google Analytics once you start collecting data, go to Visitors > Custom Variables and select the test name that you chose earlier.

Measure the results#section5

After some time (typically a few weeks, depending upon the traffic to the test), check in on the results of your test and compare the conversion rate of each version. Each A/B test should reach statistical significance before you can trust its result. You can find different calculators online to see if you’ve reached a 95% confidence level in your test. Significance is calculated using the total number of users who participated in each version of the test and the number of conversions in each version; too few users or conversions and you’ll need more data before confirming the winner. Usereffect.com’s calculator can help you understand how many more users you’ll need before reaching 95% confidence. Ending a test too early can mean that your “winning” version isn’t actually the best choice, so measure carefully.

The more visitors that see your test, the faster the test will go. It’s important to run A/B tests on high-traffic areas of your site so that you can more quickly reach statistical significance. As you get more practice with split testing, you’ll find that the more visitors who see the test, the easier it will be to reach a 95% confidence level.

A/B test examples#section6

Say I’m a developer for an e-commerce site. As A/B tests are perfect for testing one page element at a time, I created an A/B test to solve a disagreement over whether we wanted to bold a part of a product name in a user’s account. We had a long list of products in the user interface to help users manage their product renewals, and we weren’t sure how easy it was for users to scan. In Version A, the list items appeared with a bolded domain name:

service name, yourdomainname.com

While Version B looked like this:

service name, yourdomainname.com

After reaching enough conversions to reach a 95% confidence level, here were the results:

	E-commerce Conversion Rate	Per Visit Value
Version A	26.87%	$11.28
Version B	23.26%	$10.62

Version A was our clear winner, and it helped us to understand that users likely scanned for their domain name in a list of products.

User interaction is another metric to check as you’re creating A/B tests. We compared levels of aggression in content tone in one test, and watched to see how visitor patterns changed.

Version A’s text:

Don’t miss out on becoming a VIP user. Sign up now.

Version B’s text:

Don’t be an idiot; become a VIP!

Bounce rates can be a good A/B test metric to watch for landing pages. As we watched the numbers, the versions’ bounce rates were significantly different:

	Bounce Rate
Version A	0.05%
Version B	0.13%

We naturally wanted to be cautious about too-aggressive text, and the bounce rate indicated that the more aggressive version could be alienating users. Occasionally, you may want to dig more deeply into this data once you’ve reached statistical significance, especially if you have a diverse user base. In another content test, I separated the bounce rate data by country using Google Analytics.

	Version A Bounce Rate	Version B Bounce Rate
United States	13.20%	16.50%
Non-US	15.64%	16.01%

Version B had a more consistent bounce rate between versions, and we realized we needed to do more tests to see why version A was performing so differently for the two user groups.

In addition to design and content tests, you can also run experiments on functionality. We had a button that simply added a product to the user’s cart. In both versions of our A/B test, we used the same button language and style. The only difference between the two versions was that version A’s button added the product to the cart with the one-year price. Version B added it to the cart with the two-year price.

Our goal was to measure the ecommerce conversion rate and average order value between the two versions. We weren’t sure if users who got version B would reduce the number of years in the cart down to one year, or if seeing a higher price in the cart would turn them off and prompt them to abandon the cart. We hoped that we’d earn more revenue with version B, but we needed to test it. After we reached the number of conversions necessary to make the test statistically significant, we found the following:

	Average Order Value	E-commerce Conversion Rate
Version A	$17.13	8.33%
Version B	$18.61	9.60%

Version B—the button that added the two-year version of the product to the cart—was the clear winner. We’re able to use this information to create other add-to-cart buttons across the site as well.

Red herrings#section7

Sometimes, your A/B test data will be inconclusive. We recently ran a test on our homepage to determine which content performed better; I was sure that one version would be an absolute winner. However, both versions yielded the same e-commerce conversion rate, pages per visit, and average order value. After running the test for weeks, we realized that we would likely never get significant data to make a change, so we ended the test and moved on to the next one. After a neutral result, you could choose either version to use on your site, but there will be no statistically significant data that indicates one version is “better” than the other.

Remember to not get caught up with your A/B tests; sometimes they just won’t show a difference. Give your tests enough time to make sure you’ve given it your best shot (depending upon the number of visitors who see a page, I like to let tests run for at least three weeks before checking the data). If you think the test may not be successful, end it and try something else.

Keep a running list of the different things you want to test; it’ll help you keep learning new things, and it also serves as an easy way to solve disagreements over design decisions. “I’ll add it to the A/B test list” comes in handy when appeasing decision-makers.

13 Reader Comments

murraytodd says:

August 23, 2011 at 10:35 am

Although A/B testing is arguably simpler and easier to grasp than multivariate testing (MVT) I’m curious why you would advocate such a home-grown solution when using Google Optimizer is better equipped, will handle the statistical analysis for you, automatically hooks into Google Analytics for your outcome tracking, and it would allow you to graduate from simple A/B testing to simultaneously handling multiple hypotheses with almost negligible sample requirement cost.

(The advantage to MVT being that if you can think of 4 or 5 test ideas, whereas one A/B may be disappointing and fail to “move the needle”, here the chances that one of your ideas will be significant is much higher.)
Lara Hogan says:

August 23, 2011 at 4:01 pm

@murraytodd – excellent question. I tested Google Optimizer about eight months ago, and found that the amount of page load time that it added for the test counteracted the pros of the tool for me. It may be a great tool to help people get started, but if you use it, be sure to check to see how much page load time it’s adding.

MVT is great for developers who are comfortable with A/B testing. It can add more unknowns, though, depending on the test. If you’re comfortable finding statistical significance in a MVT, that’s great! If someone is just getting started with MVT, one really stellar tool for finding winners of a multivariate test is the “Ad Comparator”:http://adcomparator.com/ . You can use it for any type of conversion, not just ads.
Reedge says:

August 26, 2011 at 10:57 pm

I see you used the Taguchi methode to test this, did you hardcode all these changes?

I think the best part of this article is that you are clear the waiting forever on a results and kill the test. At the same time realise the best results come from trying some aggressive changes like your text-change.

Two excellent lessons for anyone trying to understand A/B testing
Lara Hogan says:

August 27, 2011 at 10:35 am

@Reedge yes, I coded this changes instead of using a third-party tool to serve the versions. Handcoding the tests saved quite a bit of page load time for me, but it may be easier for others to get started with A/B testing using a third-party tool.
Reedge says:

August 27, 2011 at 1:35 pm

Hi Lara,

Indeed if you know how, its always good to save a couple of ms on load time. Nice link to the taguchi methode I’ll bookmark it, its interesting to add that to Reedge as alternative to what we got now.

Regards,

Dennis
Will Martin says:

August 29, 2011 at 1:22 pm

I’ve been thinking of trying A/B testing, but struggle with defining goals to test and ways to measure success.

Goals are the first problem. The examples given in the article revolve around persuading the user to take some concrete action: click this button, sign up for that newsletter, buy that thingy.

I work in an academic library. At base, we want our users to find reputable academic information that supports their research. The problem is that it’s very difficult to reduce that to simple, concrete actions. Although there are definite actions that go into the research process, deciding which of them to try (and in what order) is heavily contingent on the topic and purpose of the research. The approach that works well for an undergraduate writing a five page paper will be inadequate for a graduate student assembling a hundred-page annotated bibliography. But in most cases, the site traffic is anonymous – we have no way to distinguish freshmen from faculty, which makes it hard to come up with test goals that make sense.

The other problem is measuring success. Ideally, we could tell whether users found useful information based on whether they make use of it. Did they check out that book? Did they cite that article? But we mostly have no way of tracing the user in such detail. If Susy Q. Student looks up books in the catalog and then borrows one, we have no way of connecting the search she did with the checkout. We can reproduce her search easily enough, but without knowing the research question that brought her to the site, it’s hard to assess whether the results met her needs or not.

How would you go about designing an A/B test in support of a more abstract goal like this? Or would you use some technique other than A/B testing for approaching this problem? I’d be interested to hear any comments.
Lara Hogan says:

August 29, 2011 at 3:10 pm

@Will – really thoughtful question. It sounds like the first thing you may need to do is set up a better analytics solution. Are you able to figure out the total books checked out in a certain time period, number of users on your system, and number of users who have books checked out? If you’re currently not able to track basic user workflows and these metrics, then it’ll be really difficult to do any A/B testing, since you won’t have a baseline or a way to measure success.

Once you do have a better data collection solution, then you can start looking at the problems you’re looking to solve. Are there students that do tons of searches but never check anything out? You could test different tweaks to the search results pages to see what helps people find what they’re looking for. Or, do lots of students log on but never visit helpful parts of your system outside of search? You can A/B test ways to better highlight the different useful tools you offer.

Note that most basic solutions, like Google Analytics, will tell you what brought your users to your site (search, referring link, search terms, etc.). You can also set cookies or another method of tracking your users between search and checkout. The key here is that you need more data – I hope that helps!
Articles Publication Services says:

August 30, 2011 at 6:54 pm

The other problem is measuring success. Ideally, we could tell whether users found useful information based on whether they make use of it. Did they check out that book? Did they cite that article? But we mostly have no way of tracing the user in such detail. If Susy Q. Student looks up books in the catalog and then borrows one, we have no way of connecting the search she did with the checkout. We can reproduce her search easily enough, but without knowing the research question that brought her to the site, it’s hard to assess whether the results met her needs or not.
“Buy Articles with Publication”:http://local-impact.org
Will Martin says:

August 31, 2011 at 4:40 am

Tracking user workflows is fiendishly difficult. Like most academic libraries, most of our site is actually just a connection point with third party services. The following are run by third parties:

* Our catalog (which we share with fifty or so other libraries)
* Our databases of articles (about 300 of these, from a few dozen vendors)
* The link resolver (which checks whether a given article is available in the databases)

In all of these cases, we have little or no control over the UI that is presented to the user. Once the user has initiated a search in our holdings, they are to all intents and purposes no longer on “our” site even though it’s our data they’re searching. And one user on a moderately intense research session could very easily hit the catalog, three different article databases, and the link resolver, resulting in usage data which is split across multiple silos.

The catalog is particularly vexing, because even if we did manage to get analytics out of it, our traffic would be all mixed up with the traffic from every other library in the consortium. Most of these third party vendors (Ebsco, ProQuest, Elsevier to name a few big ones) can provide usage statistics; but these are mostly pre-made reports rather than raw data, they all report slightly different things, and it’s hard to tell whether the stats from vendor A are comparable with those from vendor B.

We’ve had Google Analytics installed and running for years. Some of the data it provides is very useful. But that data has distinct limits. 68% of our visitors hit the home page and immediately depart for a third party site. I’ve put in some code to track *where* they go, but I cannot track what they do or where they go on a third-party site.

The more I think about it, the more I think that I really need to do some traditional usability testing. Under those kind of controlled circumstances I can at least sit at their elbow and watch what they do.

Maybe I could use A/B testing for some more fine grained stuff which has to do with our own site, for example labeling choices. Hmm. Have to put some more thought into that.
micahn says:

September 1, 2011 at 11:57 am

I’ve noticed a difference in the way Hubspot and Google Optimizer run A/B tests, which leads me to a question about A/B testing in general. It doesn’t look like Hubspot plants a cookie in the user’s browser, and so over the course of many visits the user will see both A and B served up randomly. On the other hand, Google Optimizer plants a cookie, and either A *or* B is served up persistently. In other words, if a user sees A once, it’s A for the length of the experiment, no matter how many times he/she visits the page(s) where the test is.

My hunch is that Google Optimizer does it the better way. Users should be given one variable — one chance to vote with their click over time — and that’s it. Are both approaches valid? Is one preferred over another?
Lara Hogan says:

September 1, 2011 at 2:07 pm

I prefer persistent A/B versions (the way Google Optimizer runs). Both ways are valid, but when you look at the results of your test, be sure to note which way (persistent or not) the test was run.

For example, if I’m testing an account setting, I want it to look the same for a user each time they log in. This will help me measure the effects of the setting and its success. If it changes each time, it may confuse the user, which will add a new variable to your test results (and may invalidate the results, depending upon how you’re measuring success of each version).

Hubspot’s way could still be helpful in some cases, but I wouldn’t use it for any A/B test that examines user workflows or other actions that may be repeated by the same user. Hubspot’s way could work for things like sidebar ad text or other content the user may see once – but it really depends on the test.
digitalark says:

September 2, 2011 at 9:17 am

Hi Lara

This is a new area to me. I have used Adwords for a while but for some reason the concept of testing different scenarios in a systematic way never really clicked with me until last couple of weeks.

Now I am on a mission to learn as quickly as possible.

Thanks for a great article.

Simon
Pingback: How to use framing to shape your messaging strategy - Cutting Edge PR

Got something to say?

We have turned off comments, but you can see what folks had to say before we did so.

More from ALA

Design for Amiability: Lessons from Vienna

by Mark Bernstein

Computing was born in a Viennese café. Between 1928 and 1934, while Hitler plotted and Europe crumbled, a motley crew of mathematicians, philosophers, architects, and economists gathered weekly to puzzle out the limits of reason—and invented Computer Science in the process. What made their collaboration possible wasn't just brilliance (though they had plenty). It was amiability: the careful design of a social space where difficult people could disagree without destroying each other. Longtime A List Apart contributing author Mark Bernstein mines this forgotten history for lessons that might just save today's embattled web from its worst impulses. Spoiler: it involves better coffee service and the looming threat of public humiliation.

Design Dialects: Breaking the Rules, Not the System

by Michel Ferreira

Design systems aren't component libraries—they’re living languages. Rigid adherence to visual rules creates brittle systems that break under contextual pressure. Fluent systems bend without breaking.

An Holistic Framework for Shared Design Leadership

by Michel Ferreira

Having both a Design Manager and a Lead Designer on the same team is beautiful, but can be messy. To make it work without creating confusion, overlap, or “too many cooks,” check Michel Ferreira’s Holistic Framework for Shared Design Leadership.

From Beta to Bedrock: Build Products that Stick.

by Liam Nugent

Building towards bedrock means sacrificing some short-term growth potential in favour of long-term stability. But the payoff is worth it: products built with a focus on bedrock will outlast and outperform their competitors, and deliver sustained value to users over time. Liam Nugent shows us how.

User Research Is Storytelling

by Gerry Duffy

At a time when budgets for user experience research seem to have reached an all-time low, how do we get stakeholders and executives alike invested in this crucial discipline? Gerry Duffy walks us through how the research we conduct is much like telling a compelling story, complete with a three-act narrative structure, character development, and conflict resolution—with a happy ending for researchers and stakeholders alike.