Posted in A/B Split Testing on January 26th, 2010

Recently, I published an A/B split testing case study where an eCommerce store reduced bounce rate by 20%. Some of the blog readers were worried about statistical significance of the results. Their main concern was that a value of 125-150 visitors per variation is not enough to produce reliable results. This concern is a typical by-product of having superficial knowledge of statistics which powers A/B (and multivariate) testing. I’m writing this post to provide an essential primer on mathematics of testing so that you never jump to a conclusion on reliability of a test results simply on the basis of number of visitors.

**What exactly goes behind A/B split testing?**

Imagine your website as a black box containing balls of two colors (red and green) in unequal proportions. Every time a visitor arrives on your website he takes out a ball from that box: if it is green, he makes a purchase. If the ball is of red color, he leaves the website. This way, essentially, that black box decides the conversion rate of your website.

A key point to note here is that you cannot look inside the box to count the number of balls of different colors in order to determine true conversion rate. You can only *estimate* the conversion rate based on different balls you see coming out of that box. Because conversion rate is an estimate (or a guess), you always have a range for it; never a single value. For example, mathematically, the way you describe a range is:

“Based on the information I have, 95% of the times conversion rate of my website ranges from 4.5%-7%.”

As you would expect, with more number of visitors, you get to observe more number of balls. Hence, your range gets narrower and your estimate starts approaching true conversion rate.

**The maths of A/B split testing**

Mathematically, the conversion rate is represented by a binomial random variable, which is a fancy way of saying that it can have two possible values: conversion or non-conversion. Let’s call this variable as *p*. Our job is to estimate the value of *p *and for that we do *n* trials (or observe *n* visits to the website). After observing those *n* visits, we calculate how many visits resulted in a conversion. That percentage value (which we represent from 0 to 1 instead of 0% to 100%) is the conversion rate of your website.

Now imagine that you repeat this experiment multiple times. It is very likely that, due to chance, every single time you will calculate a different value of *p*. Having all (different) values of *p*, you get a range for the conversion rate (which is what we want for next step of analysis). To avoid doing repeated experiments, statistics has a neat trick in its toolbox. There is a concept called *standard error*, which tells how much deviation from average conversion rate (*p*) can be expected if this experiment is repeated multiple times. Smaller the deviation, more confident you can be about estimating true conversion rate. For a given conversion rate (*p*) and number of trials (*n*), standard error is calculated as:

Standard Error (SE) = Square root of (p * (1-p) / n)

Without going much into details, to get 95% range for conversion rate multiply the standard error value by 2 (or 1.96 to be precise). In other words, you can be sure with 95% confidence that your true conversion rate lies within this range: *p* % ± 2 * *SE*

(In Visual Website Optimizer, when we show conversion rate range in reports, we show it for 80%, not 95%. So we multiply standard error by 1.28)

**What does it have to do with reliability of results?**

In addition to calculating conversion rate of the website, we also calculate a range for its variations in an A/B split test. Because we have already established (with 95% confidence) that true conversion rate lies within that range, all we have to observe now is the overlap between conversion rate range of the website (control) and its variation. If there is no overlap, the variation is definitely better (or worse if variation has lower conversion rate) than the control. It is that simple.

As an example, suppose control conversion rate has a range of 6.5% ± 1.5% and a variation has range of 9% ± 1%. In this case, there is no overlap and you can be sure about the reliability of results.

**Do you call all that math simple?**

Okay, not really simple but it is definitely intuitive. To save the trouble of doing all the math by yourself, either use a tool like Visual Website Optimizer which automatically does all the number crunching for you. Or, if you are doing a test manually (such as for Adwords), use our free A/B split test significance calculator.

**So, what is the take-home lesson here?**

*Always, always, always* use an A/B split testing calculator to determine significance of results before jumping to conclusions. Sometimes you may discount significant results as non-significant solely on the basis of number of visitors (such as you may do for this case study). Sometimes you may think results are significant due to large number of visitors when in fact they are not (such as here). You really want to avoid both scenarios, don’t you?

### More optimization awesomeness

### Paras Chopra

CEO and Founder of Wingify by the day, startups, marketing and analytics enthusiast by the afternoon, and a nihilist philosopher/writer by the evening!

# The Complete A/B Testing Guide

Know all that you need to get started:

- What is A/B Testing?
- Is it compatible with SEO?
- How to start your first A/B test?

##### Tags

### I ♥ Split Testing Blog

### Stay up to date

### Recent Posts

### Write for this blog!

We accept high quality articles related to A/B testing and online marketing. Send your proposals to contribute@wingify.com

## 48 Comments

Brian CrayJanuary 26, 2010

Great look at the reliability of A/B test results. When you get into quantification and accountability, many designers–the very people who need to be running A/B tests–get discouraged and never take the time do tests.

DuaneJanuary 28, 2010

Can I confirm the maths for this formula with an example? Suppose the control web page has 1000 visitors of which 100 covert (10% conversion rate) while the variation has 1000 visitors and 150 convert (15%). Would the respective SE be:

Control:

SQRT(0.1 * (1-0.1) / 1000)= 0.00949

SE = 0.00949 * 1.96 = 0.0186

thus 10% ± 1.9% = 8.1% to 11.9%

Variation:

SQRT(0.15 * (1-0.15) / 1000)= 0.01129

SE = 0.01129 * 1.96 = 0.02213

thus 15% ± 2.2% = 12.8% to 17.2%

Thus, since there is no overlap, the variation results are reliable.

Is this correct (or is another number used for n)?

Paras ChopraJanuary 28, 2010

Yes, your calculations look fine to me.

DuaneJanuary 28, 2010

Thanks Paras.

I think what was (and still is) confusing me is that when I tried to verify it using an online calculator (e.g. http://www.dimensionsintl.com/error_calculator.html) for 95% confidence with 1000 for population, 0.1 for proportion and 100 for sample size, it gives me double the ‘standard error’ as my calculations above.

…so I suspect I am misunderstanding something either here or in using other online calculators.

Paras ChopraJanuary 28, 2010

@Duane. That standard error is double in other online calculators because it is +/-. I think they are probably reporting error around mean while in this article I give a range. It is a matter of reporting -x to +x v/s 2x

DuaneJanuary 28, 2010

@Paras. Thanks – makes sense now. The fact that the difference was half/double give me a suspicion it was something like that, but as I am currently sleep deprived, it wasn’t clicking :-)

Thanks for a great post and an interesting service – I remember a few years back when services like the Visual Website Optimiser were so expensive, individuals and small companies couldn’t afford them. So nice to see that changing.

ClintJanuary 30, 2010

Thanks for this great explanation, really helps!

MukulFebruary 16, 2010

(1) I think there is an error in formula used by Duane for SE, Standard Error. There is NO 1.96, the t-value, for 95% confidence in SE formula).

(2) SE is one standard deviation

(3) Range is typically reported as +/- t multiplied by SE

(4) The constant (critical t value 1.96 is an approximation, assuming normality for n=30; the

critical t value changes as n changes.

(5) people use Two-sided tests(Are two conversion rates different? (as in here) versus One-sided tests ( Is the new change better than the contol?)

use of different types of tests will result in different t values. Duane probabaly sees the effects of different n values resulting in different t-values, different type of tests one sided vs two sided.

(6) The normaility approximation is reasonable when (a) p is small and (b) number of conversions exceed 30, not number of trials.

Hope this helps,

InventovMarch 2, 2010

hi Paras,

if I use the standard error formula given in this post, the numbers I get are not matching the standard error in your image.

I have created an excel spreadsheet here. If there is something wrong with the formula please feel free to make changes: http://spreadsheets.google.com/ccc?key=0AlNACDtsQ-AzdFNzNHBaWHo4aktfUjRIcTJmek9VZXc&hl=en

InventovMarch 2, 2010

for the comment above, by image I mean http://visualwebsiteoptimizer.com/split-testing-blog/wp-content/uploads/2010/01/result.png

Paras ChopraMarch 2, 2010

Hi Inventov,

Actually, you just calculated SE – remember you need to multiply it with 1.96 to get 95% range of conversion rate. In the image, we show 80% range which corresponds to z-score of 1.28.

I have made modifications to your excel sheet and numbers do match now. If you make a great A/B testing spreadsheet, I’d love to share it here on this blog.

-Paras

InventovMarch 2, 2010

Thanks. I’ve updated the file. Paras, can you update the column on chances to beat the original with your formula?

ClayJune 2, 2010

I’m curious to know what the math is for the “chance to beat original”. How does that get decided and is it really accurate?

Paras ChopraJune 2, 2010

“Chance to beat orginal” simply measures overlap between two distribution. If there is 1% overlap between conversion rate distribution of control and variation, then there is 99% chance of variation beating the control.

The Ultimate Guide To A/B Testing - Smashing MagazineJune 24, 2010

[...] [...]

The Ultimate Guide To A/B Testing | Web Design CoolJune 24, 2010

[...] What You Should Know About the Mathematics of A/B Testing From my own blog. [...]

The Ultimate Guide To A/B Testing |June 24, 2010

[...] What You Should Know About the Mathematics of A/B Testing From my own blog. [...]

TG Designer » The Ultimate Guide To A/B TestingJune 25, 2010

[...] What You Should Know About the Mathematics of A/B Testing From my own blog. [...]

The Ultimate Guide To A/B Testing « FED视野June 25, 2010

[...] What You Should Know About the Mathematics of A/B Testing From my own blog. [...]

The Ultimate Guide To A/B Testing | i know ideaJune 27, 2010

[...] What You Should Know About the Mathematics of A/B Testing From my own blog. [...]

JohnJuly 8, 2010

Good stuff, thanks! I have an additional complication however. How do you define p and n when the conversion event may take place at some future point (i.e. not on the same day)?

So let’s say you get 10,000 visitors to your site per day who register. Then, at some future point, they may decide to ‘convert’ (make a purchase for example – can only happen once) at any time between the registration date and a year from that date or more. Of those who will convert, most do so within the first 6 months, and then the conversions trail off. How do you set up this experiment?

Say you expose two groups, Pick-A and Pick-B to two different landing pages and you want to determine the effect of the landing page on the ultimate conversion. So you create a “class” which you define as anyone who visits the landing pages for one month. At that point, the class is defined, but the test continues because they have not yet converted.

My questions are, how do you define the conversion rate (do you average the total conversions over the exposure time of one month?), how do you define the trials (is one trial the first visit to the landing page so that a trial is a unique visitor?), and how long do you wait before you stop the test and decide that you have enough conversion data?

Paras ChopraJuly 8, 2010

Hi John,

No matter how long your test is running, it won’t affect your conversions. If your visitor converts after 6 months of first getting included in the test, it will still count as a conversion (assuming you have the test still running). There are several calculators available on the Internet one on our site http://visualwebsiteoptimizer.com/ab-split-test-duration/

Using these calculators you can calculate how long to wait for the results before giving up.

JohnJuly 8, 2010

Thanks Paras!

So just to clarify with an example, let’s say I get 10,000 visitors per day for thirty days, and so I have a total of 300K in my test population. Then, over the next 6-8 months, I get different conversions per month, but in the end I get a total of 3,000 conversions. Do I then use n=300K and p=1%? i.e. do I average the TOTAL conversions over the 30 days I created my population even though they take place on very different timelines?

On a related note, are there rules of thumb about the proximity of conversion events to the page affected? To clarify, in my example, I am making a cosmetic change to a landing page. The nearest conversion event is registration – where they create an account on the landing page itself. That is a same-day event, and it makes a lot of sense that my changes in Pick-B might affect the conversion rate. However, if we now go out 6 months where the user has interacted with many different parts of my site, logging in and out, researching, etc. There are many exogenous factors that affect their purchase decision in that time that I have no influence over – life factors, income, age, competitors, etc. Is it really still valid to test to a conversion so far out based on the color of a button (or similar) far upstream?

My hypothesis is that if there is enough separation between the two events – interaction with the landing page and conversion – that even if Pick-A and Pick-B were exactly the same, that I would still likely see a slight difference in conversions between the picks. Are there tests that just don’t make sense to run?

Paras ChopraJuly 8, 2010

Hi John,

This is interesting. I think ultimately it is upon the test creator to be aware of what his conversion goals are actually going to mean. A period of six-months is too long a period, however if your test is designed with such a goal in mind, then you could of course take it as a valid goal.

Theoretically if your variations do not have any effect on the six-month goal, you should see no statistical significance in the difference between conversion rates (because visitors were randomly distributed).

But you raise an interesting point that the time horizon must make an impact, perhaps due to sheer chance group A experienced better customer service as compared to group B and that is why they converted (and not because of test variations). More you lengthen the period, more there are chances of such unknown variables impacting different groups.

I don’t have a mathematical theory for this (yet), but is is a very interesting point for sure.

-Paras

Multivariate Testing in Action: Five Simple Steps to Increase Conversion Rates - Smashing MagazineNovember 24, 2010

[...] [...]

Anne StahlAugust 20, 2011

I think there is one basic flaw with the pure mathematical approach or at least with this approach – it doesn’t take trending into account! If I see a test ‘graph’ that has a lot of ‘noise’ (both graphs cutting across each other) in other words, if one day one is winning and the next day the other variation is winning, and so on, despite using cumulative data, then I don’t trust the result. For a result to be truly trustworthy or significant, the ‘noise’ must have subsided and the trend remaining the same. In this way, I’d say there are a lot of folks calling tests ‘significant’ when in fact they are not. There is a lot of noise caused by day time, day of week, holidays, news, etc… and this will muddy your results. I’d love so see a mathematical calculation that takes time/trending into account!

Paras ChopraAugust 23, 2011

@Anne: you make a good point and it will be great to capture trending into a mathematical number. However, ‘chance to beat original’ or ‘statistical significance’ talks about results in overall context. With these metrics we want to understand what is the likelihood that variation is performing better as compared to control given a specific sample (over a number of days).

What you are asking is a number that says how consistent is the performance. Those are two different things but nevertheless consistency can be important too.

Introducing frequently used goals, test thresholds, custom currency and more!October 9, 2011

[...] by default, we declare winning variation if the (statistical) confidence is >95% (here’s the math of A/B testing if you are interested). Now from settings, you can change it to any value you want. So if you want [...]

Dossier metrics part4, pour aller plus loin: sources et ouvrages | Clement vouillonOctober 31, 2011

[...] What you really need to know about mathematics of A/B split testing [...]

AndiDecember 12, 2011

Can you explain how you get to this formula, please?

Standard Error (SE) = Square root of (p * (1-p) / n)

I don’t understand how you can calculate the standard error without knowing anything about the variance.

That would be really helpful, thank you!

Paras ChopraDecember 12, 2011

@Andi: it is a binomial distribution and for binomial distribution, variance is p * (1 – p)

RafaelDecember 13, 2011

How much overlap is allowed between the two distributions to be confident that version B is better?

You said that if there is 1% overlap between conversion rate distribution of control and variation, then there is 99% chance of variation beating the control.

What if there is a 5% overlap? In this case, is there a 95% chance of the variation beating the original?

What about a 6% overlap?

Thanks!

Paras ChopraDecember 13, 2011

@Rafael: it depends on how important results are for an organization and how much risk (of being wrong) it is willing to accept. 99% chance to beat original is always better, but if stakes aren’t high some organizations are okay with 95% chance to beat original too.

RafaelDecember 13, 2011

Hey Paras, thank you for the answer!

So the “chance to beat the control” can be measured just by measuring this overlap?

For instance, does a 10% overlap mean a 90% chance to beat the control?

And a 15% overlap -> 85% chance

20% overlap -> 80% chance

And so on, so forth. Is this the case, or did I misinterpret it?

Paras ChopraDecember 13, 2011

@Rafael: yes, your understanding is correct.

A/B spilttest giver ikke et brugbart resultatDecember 31, 2011

[...] læsning What you really need to know about mathematics of A/B split testing Tweet This entry was posted in Uncategorized. Bookmark the permalink. ← Skal kundens [...]

La guida per condurre un Test A/B pubblicata da Smashing Magazine.February 18, 2012

[...] What You Should Know About the Mathematics of A/B Testing From my own blog. [...]

KompositApril 11, 2012

Thanks a lot, this was just what I was looking for!

Keep up the good work.

5 most common A/B testing misconceptionsApril 19, 2012

[...] control) is statistically significant. You can read about mathematics of it in previous blog posts: how we calculate statistical significance and how to estimate number of visitors needed for a test. A common misconception is that a winning [...]

Why multi-armed bandit algorithm is not “better” than A/B testing « I love split testing – Visual Website Optimizer BlogJune 1, 2012

[...] finding out if a variation is performing better or worse, we use statistical tests such as a Z-test or a chi-square test, and mathematically (and intuitively) you need to have tested [...]

Why multi-armed bandit algorithm is not “better” than A/B testing Wingify « AmalikJune 1, 2012

[...] finding out if a variation is performing better or worse, we use statistical tests such as a Z-test or a chi-square test, and mathematically (and intuitively) you need to have tested [...]

PDmobile-专注移动产品设计 » A/B测试终极指南June 4, 2012

[...] What You Should Know About the Mathematics of A/B Testing From my own blog. [...]

How to Launch a Kick Ass SEM Campaign (3 of 3) – Adwords Testing and Optimization | clickTRUEJune 19, 2012

[...] In statistics, a result is statistically significant if it is unlikely to have happened by chance. Click this link to learn more about the mathematics of statistical significance. [...]

A/B testing significance calculator (spreadsheet in Excel) « I love split testing – Visual Website Optimizer BlogJuly 17, 2012

[...] can be confusing unless you know exact formulas. Earlier, we had published articles related to mathematics of A/B testing and also have a free A/B testing calculator on the site to see if your results are significant or [...]

Multivariate Testing in Action: Five Simple Steps to Increase Conversion Rates |Layout to HTMLJuly 17, 2012

[...] It is important to either use a tool which automatically crunches the reliability of results for you, or to use online calculators to gauge the confidence in results. Taking unreliable results and implementing them can actually cause decreased performance. The exact mathematics of what goes on behind split testing reliability analysis can be read in the 20bits article Statistical Analysis and A/B Testing, or my blog article Mathematics of A/B testing. [...]

JuanJuly 1, 2013

Hi,

Great article! Thanks for sharing. I have one question, what if I want to use metrics not represented by a binomial or normal distribution?

For instance, what happens if I want to compare control vs variation looking at the metric: visits/user?

Thanks,

J

dragonpandaJuly 19, 2013

I have a question about the math that goes in to finding the z-score. On the excel sheet, you used the equation: =(control_p-variation_p)/SQRT(POWER(control_se,2)+POWER(variation_se,2)) which = 1.721671363

However, shouldn’t we use the difference between two proportions (the conversion rates) to find the z-score and see if the difference is not 0? This formula would involve calculating the pooled p (conversion rate)…etc.

Also, to find whether or not it’s significantly different, don’t you have to do 1-p or 2(1-p) (for 2 tails) to find alpha and see if alpha is <= 0.05, 0.01, etc?

Thanks!

LeviAugust 28, 2013

Great blog thanks for sharing.

What considerations should be taken into account when the A/B/N test has an uneven traffic split for example in a A/B/C/D test with a traffic split of 70% (existing site) ,10%,10%,10% respectively?

## Leave a comment

RSS feed for comments on this post. TrackBack URL