A/B testing – for all the content out there about testing, huge amounts of people still mess it up. From testing the wrong things to running the tests incorrectly, there are lots of ways to get it wrong.
Here’s what we’ll cover in this guide:
- What is A/B testing and How Does It Work?
- What to Test to Improve Our Chances of Winning?
- How to Prioritize Test Hypotheses?
- How Long to A/B Test?
- How to Set up A/B Tests?
- How to Analyze A/B Test Results?
- How to Archive Past Tests?
- What You Need to Know About A/B Testing Statistics
- A/B Testing Tools and Resources
Note: this post is almost 6000 words, so you can download it in e-book format. Another benefit of doing this is you’ll get a few follow up emails with other helpful A/B testing content…
What is A/B testing and How Does It Work?
An A/B/n test is a controlled online experiment that splits your traffic evenly between a control and a variation (or multiple variations).
That’s it. For example, if you ran a simple A/B test, it would be a 50/50 traffic split between the original page and a variation:
A/B split testing is a new term for an old technique – controlled experimentation. When researchers are testing the efficacy of new drugs, they use a ‘split test.’ In fact, most research experiments could be considered a ‘split test,’ complete with a hypothesis, a control and variation, and a statistically calculated result.
The main difference, however, lies in the variability of internet traffic. In a lab, it’s easier to control for external variables. Online, you can mitigate them, but it’s truly difficult to operate a purely controlled test.
In addition, testing new drugs requires an almost certain degree of accuracy. Lives are on the line. In technical terms, your period of ‘exploration’ can be much longer, as you want to be damned sure during your period of ‘exploitation’ that you didn’t reach a type I error (false positive).
A/B split testing online is primarily a business decision. It’s a weighing of risk vs reward, exploration vs exploitation, science vs business. Therefore, we view results with a different lens and make decisions slightly differently than tests in a pure lab setting.
You can, of course, create more than two variations. Broadly known as an A/B/n test, if you have the traffic that allows it, you can test as many variations as you’d like. Here’s an example of a A/B/C/D test, and how much traffic each variation is allocated:
A/B/n tests are great for implementing more variations of the same hypothesis, but of course they require more traffic because they have to split it between more pages and be statistically valid.
A/B tests, while most popular to talk about, are just one type of online experiment – you can also run Multivariate and Bandit tests.
A/B Testing, Multivariate, and Bandit Algorithms: What’s the Difference?
A/B/n tests are controlled experiments run on 1 or more variations + the original page that directly compare conversion rate means based on the changes made between variations.
While it sounds similar, multivariate tests are controlled experiments that test multiple versions of a page and attempt to isolate which attributes cause the largest impact. In other words, multivariate tests are like A/B/n tests in that they test an original against variations, but each variation contains different design elements. For example:
Each one has a different and specific impact and use case and can help you get the most out of your site. Here’s how:
- Use A/B testing to determine best layouts
- Use MVT to polish the layouts to make sure all the elements interact with each other in the best possible way.
You need to get a ton of traffic to the page you’re testing before even considering MVT. But if you have enough traffic, you should use both types of tests to maximize the output of your optimization program.
Most agencies place a priority on A/B testing because you’re usually testing more significant changes (bigger impacts possible), and because they’re more simple to run. Peep once said, “most top agencies that I’ve talked to about this run ~10 A/B tests for every 1 MVT.”
As for bandit algorithms, you can almost think of them as A/B/n tests that update in real time based on the performance of each variation.
In essence, a bandit algorithm starts by sending traffic to two (or more) pages: the original and the variation(s). Then, in attempt to ‘pull the winning slot machine arm most often,’ the algorithm updates based on whether or not a variation is ‘winning.’ Eventually, the algorithm fully exploits the best option:
One of the big benefits of bandit testing is that bandits mitigate ‘regret,’ which is basically the lost conversion you experience while exploring a potentially worse variation in a test. This chart from Google explains that very well:
By the way, try not to think of bandits and A/B/n tests as a ‘this or that’ scenario; they’re tools that each have their purposes. In general, bandits are great for:
- Headlines and Short-Term Campaigns
- Automation for Scale
- Blending Optimization with Attribution
No matter what type of test you run, it’s important to have a process that improves your chances of success. This means running more tests, winning more tests, and making bigger lifts. How do we do that? How do we know what to test?
What to Test to Improve Our Chances of Winning?
Don’t listen to any blog posts that tell you “99 Things You Can A/B Test Right Now.” That’s a waste of time and traffic. Being a bit more process-minded will make you more money.
In a survey done by Econsultancy and RedEye, 74% of the survey respondents who reported having a structured approach to conversion also stated they had improved their sales. Those that don’t have a structured approach stay in what Craig Sullivan calls the “Trough of Disillusionment” (unless their results are littered with false positives, which we’ll get into later).
To simplify a winning process, the structure goes something like this:
- Analyze, Learn, Repeat
Research: Getting Data-Driven Insights
To begin optimization, you need to know what your users are doing and why – so start with research.
Before you think about optimization and testing, however, solidify your high-level strategy and move down from there to the granular. So think in this order:
- Define your business objectives
- Define your website goals
- Define your Key Performance Indicators
- Define your target metrics
Once you know where you want to go, you can collect the data necessary to get there. To do this, we recommend the ResearchXL Framework.
Here’s the executive summary of the process we use at ConversionXL:
- Heuristic Analysis
- Technical Analysis
- Web Analytics Analysis
- Mouse Tracking Analysis
- Qualitative Surveys
- User Testing
Heuristic analysis is about as close as we get to ‘best practices.’ However, after years of experience, you still can’t tell what exactly will work, but you can more easily point out opportunity areas. As Craig Sullivan put it:
“My experience in observing and fixing things — these patterns do make me a better diagnostician but they don’t function as truths — they guide and inform my work but they don’t provide guarantees.”
So humility is crucial, but it also helps to have a framework. When doing heuristic analysis, we assess each page based on the following:
This is a low-hanging fruit, one that you can make a lot of money on (think 12 month perspective). So start by:
- Conducting cross-browser and cross-device testing
- Doing speed analysis
Web analytics analysis is next. First thing’s first – make sure everything is working. You’d be surprised how many analytics setups are broken.
Google Analytics (and other analytics setups) are a course in themselves, so I’ll leave you with some helpful links to read:
- Google Analytics 101: How To Configure Google Analytics To Get Actionable Data
- Google Analytics 102: How To Set Up Goals, Segments & Events in Google Analytics
Next is mouse tracking analysis, which includes heat maps, scroll maps, click maps, form analytics, and user session replays. One point of advice here is to not get carried away with pretty visualizations of click maps, etc. Make sure you’re informing your larger goals with the analytics in this step.
Qualitative research is an important part of measurement as well, because it tells you the why that quantitative analysis misses. Many people think that qualitative analysis is “softer” or easier than quantitative, but it should be just as rigorous and can provide just as important of insights as your GA data.
For qualitative research, use things like:
Finally there’s user testing. The premise is simple: observe actual people use and interact with your website while they’re commenting their thought process out loud. Pay attention to what they say and experience.
After the heavy-ass conversion research, you’ll have lots of data and need to do some prioritization.
How to Prioritize Test Hypotheses?
There are many frameworks to prioritize your A/B tests, and you could even innovate with your own formula. Here’s a way to prioritize and stream work shared by Craig Sullivan. Once you go through all 6 steps, you will find issues – some of them severe, some minor. You’ll want to allocate every finding into one of these 5 buckets:
- Test. (This bucket is where you place stuff for testing.)
- Instrument. (This can involve fixing, adding or improving tag or event handling on the analytics configuration.)
- Hypothesize. (This is where you’ve found a page, widget or process that’s just not working well but we don’t see a clear single solution.)
- Just Do It – JFDI. (Here’s the bucket for no-brainers. Just do it)
- Investigate. (If an item is in this bucket, you need to ask questions or do further digging.)
Then we rank them from 1 to 5 stars (1= minor issue, 5 = critically important). There are 2 criteria that are more important than others when giving a score:
- Ease of implementation (time/complexity/risk). Sometimes the data tells you to build a feature, but it takes months to do it. So it’s not something you’d start with.
- Opportunity score (subjective opinion on how big of a lift you might get).
Then create a spreadsheet with all of your data and you’ll have a prioritized testing roadmap, more rigorous than most of your competitors will have
We also created our own prioritization model that attempts to weed out as much subjectivity as possible. It’s predicated on the necessity of bringing data to the table. It’s called PXL and looks like this:
Grab your own copy of this spreadsheet template here. Just click File > Make a Copy to have your own customizable spreadsheet.
- Instead of guessing what the impact might be, this framework asks you a set of questions about it.
- Is the change above the fold? → Changes above the fold are noticed by more people, thus increasing the likelihood of the test having an impact
- Is the change noticeable in under 5 seconds? → Show a group of people control and then variation(s), can they tell the difference after seeing it for 5 seconds? If not, it’s likely to have less impact
- Does it add or remove anything? → Bigger changes like removing distractions or adding key information tend to have more impact
- Does the test run on high traffic pages? → Relative improvement on a high traffic page results in more absolute dollars.
We’ve seen the power of solid conversion research, so many of the variables specifically require you to bring data to the table to prioritize your hypotheses.
- Is it addressing an issue discovered via user testing?
- Is it addressing an issue discovered via qualitative feedback (surveys, polls, interviews)?
- Is the hypothesis supported by mouse tracking heat maps or eye tracking?
- Is it addressing insights found via digital analytics?
Having weekly discussions on tests with these 4 questions asked from everyone will quickly make people stop relying on just opinions.
Then we also put bounds on Ease of implementation by bracketing answers according to the estimated time. Ideally you’d have a test developer be part of prioritization discussions.
We made this under the assumption of a binary scale – you have to choose one or the other. So for most variables (unless otherwise noted), you choose either a 0 or a 1.
But we also wanted to weight certain variables because of their importance – how noticeable the change is, if something is added/removed, ease of implementation. So on these variables, we specifically say how things change. For instance, on the Noticeability of the Change variable, you either mark it a 2 or a 0.
We built this model with the belief that you can and should customize the variable based on what matters to your business.
For example, maybe you’re operating in tangent with a branding or user experience team, and it’s very important that the hypothesis conforms to brand guidelines. Add it as a variable.
Maybe you’re at a startup whose acquisition engine is fueled primarily by SEO. Maybe your funding depends on that stream of customers. So you could add a category like, “doesn’t interfere with SEO,” which might alter some headline or copy tests.
Point is, all organizations operate under different assumptions, but by customizing the template, you can account for them, and optimize your optimization program.
Whatever framework you use, try to make it systematic and understandable to anyone on the team as well as stakeholders involved.
How Long to A/B Test?
First rule: don’t stop a test just because it reaches statistical significance. This is probably the most common error committed by beginning optimizers with good intentions.
If you’re calling your tests when you hit significance, you’ll find that most of your lifts don’t translate to increased revenue (that’s the goal, afterall). You’ll find that the lifts were in fact imaginary.
Consider this: When one thousand A/A tests (two identical pages tested against each other) were run:
- 771 experiments out of 1.000 reached 90% significance at some point
- 531 experiments out of 1.000 reached 95% significance at some point
Stopping tests at significance breeds the risk of false positives and excludes possible external validity threats like seasonality.
Instead, you’ll want to predetermine a sample size and run the test for full weeks, usually for at least two business cycles.
How do you predetermine sample size? There are lots of great tools out there for that, including tools within your favorite testing tool. Here’s how you’d calculate your sample size with Evan Miller’s tool:
In this case we told the tool that we have a 3% conversion rate, and want to detect at least 10% uplift. The tool tells us that we need 51,486 visitors per variation before can look at the statistical significance levels and statistical power.
Oh, and you’ll notice in addition to significance level, there’s something called ‘statistical power’ in the photo above as well.
Statistical power is another important factor in running your A/B test, as it attempts to avoid Type II errors (false negatives). In other words, it makes sure that you detect an effect if there actually was one.
For practical purposes, know that 80% power is the standard for testing tools. To reach such a level, you need either a large sample size, a large effect size, or a longer duration test.
There Are No Magic Numbers
You’ll read a lot of blog posts that have magic numbers like “100 conversions” or “1,000 visitors” as their stopping points. Math is not magic, math is math, and what we’re dealing with is slightly more complex than simplistic heuristics like that. Andrew Anderson from Malwarebytes put it well:
“It is never about how many conversions, it is about having enough data to validate based on representative samples and representative behavior.
100 conversions is possible in only the most remote cases and with an incredibly high delta in behavior, but only if other requirements like behavior over time, consistency, and normal distribution take place. Even then it is has a really high chance of a type I error, false positive.”
What we’re worried about is the representativeness of our sample. How can we do that in basic terms? Your test should run for 2, business cycles, so it includes everything external that’s going on:
- every day of the week (and tested one week at a time as your daily traffic can vary a lot)
- various different traffic sources (unless you want to personalize the experience for a dedicated source)
- your blog post and newsletter publishing schedule
- people who visited your site, thought about it, and then came back 10 days later to buy it
- any external event that might affect purchasing (e.g. pay day)
Another (very important) note: be careful with low sample size. The internet is full of case studies steeped in shitty math, and most of it (if they even release full numbers) is because they judged a test on like 100 visitors per variation and 12 vs 22 conversions.
If you’ve set everything up correctly so far, then you’ll just want to avoid peaking (or letting your boss peak) at test results multiple times before the test is finished. This can result in calling a result early due to ‘spotting a trend’ (impossible). What you’ll find is that many test results regress to the mean.
Regression to the Mean
Often, you’ll see results vary wildly in the first few days of the test. Sure enough, they tend to converge as the test continues for the next few weeks. Here’s an example Peep gave in an older blog post of an eCommerce client:
Here’s what we’re looking at:
- First couple of days, blue (variation #3) is winning big – like $16 per visitor vs $12.5 for Control. Lots of people would end the test here. (Fail).
- After 7 days: blue still winning – and the relative difference is big.
- After 14 days: orange (#4) is winning!
- After 21 days: orange still winning!
- End: no difference
So if you’d called the test at less than four weeks, you would have made an erroneous conclusion.
Something related, that the internet always gets confused on, is called the novelty effect. That’s when the novelty of your changes (bigger blue button) brings more attention to the variation. With time, the lift disappears because the change is no longer novel.
All of this stuff is some of the more complex A/B testing information. We have a bunch of blog posts devoted to the various topics covered above. Dive in if you’d like to learn more:
- Stopping A/B Tests: How Many Conversions Do I Need?
- Statistical Significance Does Not Equal Validity (or Why You Get Imaginary Lifts)
Can You Run Multiple A/B Tests Simultaneously?
You want to speed up your testing program and run more tests. High tempo testing, yeah? So a common question is: can you run more than one A/B test at the same time on your site?
Will this increase your growth potential, or will it pollute the data because each test interacts with the other?
Look, this is a complicated issue. Some experts say you shouldn’t do multiple tests simultaneously, and some say it’s fine.
In most cases you will be fine running multiple simultaneous tests, and extreme interactions are unlikely. Unless you’re testing really important stuff (e.g. something that impacts your business model, future of the company), the benefits of testing volume will most likely outweigh the noise in your data and occasional false positives.
If based on your assessment there’s a high risk of interaction between multiple tests, reduce the number of simultaneous tests and/or let the tests run longer for improved accuracy.
If you want to read more on this, read these posts:
How to Set up A/B Tests?
Once you’ve got a prioritized list of test ideas, it’s time to form a hypothesis and run an experiment. Basically, a hypothesis will define why you believe a problem occurs. Furthermore, a good hypothesis:
- Is testable – It needs to be measurable, so that it can be used in testing.
- Has a goal of solving conversion problems – Split testing is done to solve specific conversion problems
- Gains market insights – A well-articulated hypothesis will let your split testing results give you information about your customers, whether the test ‘wins’ or ‘loses’ or whatever.
Craig Sullivan has put together a hypothesis kit to simplify the process. Here’s his simple version:
- Because we saw (data/feedback)
- We expect that (change) will cause (impact)
- We’ll measure this using (data metric)
And the advanced one:
- Because we saw (qual & quant data)
- We expect that (change) for (population) will cause (impact(s))
- We expect to see (data metric(s) change) over a period of (x business cycles)
Here’s the fun part: you can finally think about picking a tool.
While this is the first thing many people think about, it’s not actually the most important, by any means. The strategy and statistical knowledge aspects come first, and only then should you worry about picking a tool.
That said, there are a few differences you should bear in mind.
One major categorization in tools is whether they are server side or client side testing tools.
Client-side testing tools are things like Optimizely, VWO, and Adobe Target. Conductrics has capabilities of both, and SiteSpect does a proxy server-side method.
What does all this mean for you? If you’d like to save time up front, or if your team is small or lacks development resources, client-side tools can get you up and running faster. Server-side requires development resources but can often be more robust.
You’ll basically want to set up goals (something that lets you know a conversion has been made, like a ‘thank you for purchasing’ page), and your testing tool will track when each variation converts visitors into customers.
Or you could use something like Testing.Agency to set up your tests for you.
How to Analyze A/B Test Results?
Alright. You’ve done your research, set up your test correctly, and the test is finally cooked. Now, on to analysis – and it’s not always as simple as glimpsing at the graph your testing tool gives you.
One thing you should always do it to analyze your test results in Google Analytics.
It doesn’t just enhance your analysis capabilities, but it allows you to be more confident in your data and decision making.
The point is, it’s possible that your testing tool could be recording the data incorrectly, and if you have no other source for your test data, you can never be sure whether to trust it or not. Create multiple sources of data (won’t go too far into detail, but read this post for how to set it all up)
But what happens if, after analyzing the results in GA, there is no difference at all between variations?
Don’t move on too quickly. First, realize these two things:
1. Your test hypothesis might have been right, but the implementation sucked.
Let’s say your qualitative research says that concern about security is an issue. How many ways do we have to beef up the perception of security? Unlimited.
The name of the game is iterative testing, so if you were onto something, then try a few iterations that attempt to solve the problem.
2. Just because there was no difference overall, the variation might have beat control in a segment or two.
If you got a lift in returning visitors and mobile visitors, but a drop for new visitors and desktop users – those segments might cancel each other out, and it seems like it’s a case of “no difference.” Analyze your test across key segments to see this.
All About Data Segmentation
The key to learning in A/B testing is segmenting. Even though B might lose to A in the overall results, B might beat A in certain segments (organic, Facebook, mobile, etc).
There are a ton of segments you can analyze. Optimizely lists the following possibilities:
- Browser type
- Source type
- Mobile vs. desktop, or by device
- Logged-in vs. logged-out visitors
- PPC/SEM campaign
- Geographical regions (City, State/Province, Country)
- New vs. returning visitors
- New vs. repeat purchasers
- Power users vs. casual visitors
- Men vs. women
- Age range
- New vs. already-submitted leads
- Plan types or loyalty program levels
- Current, prospective, and former subscribers
- Roles (if your site has, for instance, both a Buyer and Seller role)
But definitely look at your test results at least across these segments (making sure of adequate sample size):
- Desktop vs Tablet/Mobile
- New vs Returning
- Traffic that lands directly on the page you’re testing vs came via internal link
For segments, the same stopping rules apply.
Make sure that you have enough sample size within the segment itself as well (calculate it in advance, be wary if it’s less than 250-350 conversions PER variation within that one segment you’re looking at).
If your treatment performed well for a specific segment, it’s time to consider a personalized approach for that particular segment.
How to Archive Past Tests?
A/B testing isn’t just about lifts, wins, losses, and testing random shit. As Matt Gershoff said, optimization is about “gathering information to inform decisions,” and the learnings from statistically valid A/B test results contribute to the greater goals of growth and optimization.
Smart organizations archive their test results and plan their approach to testing systematically. There’s a reason having a structured approach to optimization have greater growth and are limited less often by local maxima.
So here’s the tough part: there’s no single best way to structure your knowledge management.
We wrote an article on how effective organizations archive their results (read it), and as it turns out, many of them do it slightly differently. Some use sophisticated internally-built tools, some use 3rd party tools, and some use good ol’ Excel and Trello.
If it helps, here are 4 tools built specifically for conversion optimization project management:
On a similar note, in larger organizations (or hell, in smaller as well), it’s important to be able to communicate across departments and to the executives above. Often, A/B test results aren’t super intuitive to the layperson (and most people haven’t read guides as long as this one). So what helps is visualization.
This is another area where, sadly, there is not real right way to do it. That said, Annemarie Klaassen and Ton Wesseling wrote an awesome post on our blog detailing their journey to great visualizations. Sneak peek, here’s what they ended up with:
What You Need to Know About A/B Testing Statistics
There’s a certain level of statistical knowledge that comes in handy when analyzing A/B test results. Some of it we went over in the above section on setting up A/B tests, but there is still more to be covered when it comes to analysis.
Why do you need to know all of this statistics stuff? We’re dealing with inference here – means and probability – and therefore cannot go without some basic understanding of stats.
Or as Matt Gershoff put it (quoting his college math professor), “how can you make cheese if you don’t know where milk comes from?!”
There are three terms you should know before we dive into the nitty gritty of A/B testing statistics:
- Mean (we’re not measuring all conversion rates, just a sample, and finding an average of them that is representative of the whole)
- Variance (what is the natural variability of a population? That will affect our results and how we take action with them)
- Sampling (again, we can’t measure true conversion rate, so we select a sample that is hopefully representative of the whole)
What The Hell Is a P-Value?
There’s a large amount of bloggers writing about conversion optimization that are using the term “statistical significance” inaccurately.
We talked a bit above about how statistical significance by itself is not a stopping rule, so what is it and why is it important?
To start with, let’s go over P-Values, which are also very misunderstood. As FiveThirtyEight recently pointed out, even scientists can’t easily explain what P-Values are.
P-Value is basically measure of evidence against the null hypothesis (the control in A/B Testing parlance).
Very important: P-value does not tell us the probability that B is better than A.
Similarly, it doesn’t tell us the probability that we will make a mistake in selective B over A. These are both extraordinarily commons misconceptions, but they are false.
The p-value is just the probability of seeing a result or more extreme given that the null hypothesis is true. Or, “How surprising is that result?”
So to sum it up, statistical significance (or a statistically significant result) is attained when a p-value is less than the significance level (which is usually set at .05). By the way, significance in regards to statistical hypothesis testing is where the whole one-tail vs two-tail issue comes up.
One-Tail vs Two-Tail A/B Tests
I promise you, this is a much smaller issue than some people think.
One-tailed tests allow for the possibility of an effect in just one direction where with two-tailed tests, you are testing for the possibility of an effect in two directions – both positive and negative.