The Next Generation of Visual Website Optimizer is launching in April 2014 See What's Coming

What you really need to know about mathematics of A/B split testing

Posted in A/B Split Testing on January 26th, 2010

Recently, I  published an A/B split testing case study where an eCommerce store reduced bounce rate by 20%. Some of the blog readers were worried about statistical significance of the results. Their main concern was that a value of 125-150 visitors per variation is not enough to produce reliable results. This concern is a typical by-product of having superficial knowledge of statistics which powers A/B (and multivariate) testing. I’m writing this post to provide an essential primer on mathematics of testing so that you never jump to a conclusion on reliability of a test results simply on the basis of number of visitors.

What exactly goes behind A/B split testing?

Imagine your website as a black box containing balls of two colors (red and green) in unequal proportions. Every time a visitor arrives on your website he takes out a ball from that box: if it is green, he makes a purchase. If the ball is of red color, he leaves the website. This way, essentially, that black box decides the conversion rate of your website.

A key point to note here is that you cannot look inside the box to count the number of balls of different colors in order to determine true conversion rate. You can only estimate the conversion rate based on different balls you see coming out of that box. Because conversion rate is an estimate (or a guess), you always have a range for it; never a single value. For example, mathematically, the way you describe a range is:

“Based on the information I have, 95% of the times conversion rate of my website ranges from 4.5%-7%.”

As you would expect, with more number of visitors, you get to observe more number of balls. Hence, your range gets narrower and your estimate starts approaching true conversion rate.

The maths of A/B split testing

Mathematically, the conversion rate is represented by a binomial random variable, which is a fancy way of saying that it can have two possible values: conversion or non-conversion. Let’s call this variable as p. Our job is to estimate the value of p and for that we do n trials (or observe n visits to the website). After observing those n visits, we calculate how many visits resulted in a conversion. That percentage value (which we represent from 0 to 1 instead of 0% to 100%) is the conversion rate of your website.

Now imagine that you repeat this experiment multiple times. It is very likely that, due to chance, every single time you will calculate a different value of p. Having all (different) values of p, you get a range for the conversion rate (which is what we want for next step of analysis). To avoid doing repeated experiments, statistics has a neat trick in its toolbox.  There is a concept called standard error, which tells how much deviation from average conversion rate (p) can be expected if this experiment is repeated multiple times. Smaller the deviation, more confident you can be about estimating true conversion rate. For a given conversion rate (p) and number of trials (n), standard error is calculated as:

Standard Error (SE) = Square root of (p * (1-p) / n)

Without going much into details, to get 95% range for conversion rate multiply the standard error value by 2 (or 1.96 to be precise). In other words, you can be sure with 95% confidence that your true conversion rate lies within this range: p % ± 2 * SE

(In Visual Website Optimizer, when we show conversion rate range in reports, we show it for 80%, not 95%. So we multiply standard error by 1.28)

What does it have to do with reliability of results?

In addition to calculating conversion rate of the website, we also calculate a range for its variations in an A/B split test. Because we have already established (with 95% confidence) that true conversion rate lies within that range, all we have to observe now is the overlap between conversion rate range of the website (control) and its variation. If there is no overlap, the variation is definitely better (or worse if variation has lower conversion rate) than the control. It is that simple.

As an example, suppose control conversion rate has a range of  6.5% ± 1.5% and a variation has range of 9% ± 1%. In this case, there is no overlap and you can be sure about the reliability of results.

Do you call all that math simple?

Okay, not really simple but it is definitely intuitive. To save the trouble of doing all the math by yourself, either use a tool like Visual Website Optimizer which automatically does all the number crunching for you. Or, if you are doing a test manually (such as for Adwords), use our free A/B split test significance calculator.

So, what is the take-home lesson here?

Always, always, always use an A/B split testing calculator to determine significance of results before jumping to conclusions. Sometimes you may discount significant results as non-significant solely on the basis of number of visitors (such as you may do for this case study). Sometimes you may think results are significant due to large number of visitors when in fact they are not (such as here). You really want to avoid both scenarios, don’t you?

Paras Chopra

CEO and Founder of Wingify by the day, startups, marketing and analytics enthusiast by the afternoon, and a nihilist philosopher/writer by the evening!

The Complete A/B Testing Guide

Know all that you need to get started:

  • What is A/B Testing?
  • Is it compatible with SEO?
  • How to start your first A/B test?
Show Me The Guide


Tags

48 Comments
Brian Cray
January 26, 2010

Great look at the reliability of A/B test results. When you get into quantification and accountability, many designers–the very people who need to be running A/B tests–get discouraged and never take the time do tests.

Duane
January 28, 2010

Can I confirm the maths for this formula with an example? Suppose the control web page has 1000 visitors of which 100 covert (10% conversion rate) while the variation has 1000 visitors and 150 convert (15%). Would the respective SE be:

Control:
SQRT(0.1 * (1-0.1) / 1000)= 0.00949
SE = 0.00949 * 1.96 = 0.0186
thus 10% ± 1.9% = 8.1% to 11.9%

Variation:
SQRT(0.15 * (1-0.15) / 1000)= 0.01129
SE = 0.01129 * 1.96 = 0.02213
thus 15% ± 2.2% = 12.8% to 17.2%

Thus, since there is no overlap, the variation results are reliable.

Is this correct (or is another number used for n)?

Paras Chopra
January 28, 2010

Yes, your calculations look fine to me.

Duane
January 28, 2010

Thanks Paras.

I think what was (and still is) confusing me is that when I tried to verify it using an online calculator (e.g. http://www.dimensionsintl.com/error_calculator.html) for 95% confidence with 1000 for population, 0.1 for proportion and 100 for sample size, it gives me double the ‘standard error’ as my calculations above.

…so I suspect I am misunderstanding something either here or in using other online calculators.

Paras Chopra
January 28, 2010

@Duane. That standard error is double in other online calculators because it is +/-. I think they are probably reporting error around mean while in this article I give a range. It is a matter of reporting -x to +x v/s 2x

Duane
January 28, 2010

@Paras. Thanks – makes sense now. The fact that the difference was half/double give me a suspicion it was something like that, but as I am currently sleep deprived, it wasn’t clicking :-)

Thanks for a great post and an interesting service – I remember a few years back when services like the Visual Website Optimiser were so expensive, individuals and small companies couldn’t afford them. So nice to see that changing.

Clint
January 30, 2010

Thanks for this great explanation, really helps!

Mukul
February 16, 2010

(1) I think there is an error in formula used by Duane for SE, Standard Error. There is NO 1.96, the t-value, for 95% confidence in SE formula).
(2) SE is one standard deviation
(3) Range is typically reported as +/- t multiplied by SE
(4) The constant (critical t value 1.96 is an approximation, assuming normality for n=30; the
critical t value changes as n changes.
(5) people use Two-sided tests(Are two conversion rates different? (as in here) versus One-sided tests ( Is the new change better than the contol?)
use of different types of tests will result in different t values. Duane probabaly sees the effects of different n values resulting in different t-values, different type of tests one sided vs two sided.
(6) The normaility approximation is reasonable when (a) p is small and (b) number of conversions exceed 30, not number of trials.
Hope this helps,

Inventov
March 2, 2010

hi Paras,

if I use the standard error formula given in this post, the numbers I get are not matching the standard error in your image.

I have created an excel spreadsheet here. If there is something wrong with the formula please feel free to make changes: http://spreadsheets.google.com/ccc?key=0AlNACDtsQ-AzdFNzNHBaWHo4aktfUjRIcTJmek9VZXc&hl=en

Paras Chopra
March 2, 2010

Hi Inventov,

Actually, you just calculated SE – remember you need to multiply it with 1.96 to get 95% range of conversion rate. In the image, we show 80% range which corresponds to z-score of 1.28.

I have made modifications to your excel sheet and numbers do match now. If you make a great A/B testing spreadsheet, I’d love to share it here on this blog.

-Paras

Inventov
March 2, 2010

Thanks. I’ve updated the file. Paras, can you update the column on chances to beat the original with your formula?

Clay
June 2, 2010

I’m curious to know what the math is for the “chance to beat original”. How does that get decided and is it really accurate?

Paras Chopra
June 2, 2010

“Chance to beat orginal” simply measures overlap between two distribution. If there is 1% overlap between conversion rate distribution of control and variation, then there is 99% chance of variation beating the control.

[...] What You Should Know About the Mathematics of A/B Testing From my own blog. [...]

[...] What You Should Know About the Mathematics of A/B Testing From my own blog. [...]

[...] What You Should Know About the Mathematics of A/B Testing From my own blog. [...]

[...] What You Should Know About the Mathematics of A/B Testing From my own blog. [...]

[...] What You Should Know About the Mathematics of A/B Testing From my own blog. [...]

John
July 8, 2010

Good stuff, thanks! I have an additional complication however. How do you define p and n when the conversion event may take place at some future point (i.e. not on the same day)?

So let’s say you get 10,000 visitors to your site per day who register. Then, at some future point, they may decide to ‘convert’ (make a purchase for example – can only happen once) at any time between the registration date and a year from that date or more. Of those who will convert, most do so within the first 6 months, and then the conversions trail off. How do you set up this experiment?

Say you expose two groups, Pick-A and Pick-B to two different landing pages and you want to determine the effect of the landing page on the ultimate conversion. So you create a “class” which you define as anyone who visits the landing pages for one month. At that point, the class is defined, but the test continues because they have not yet converted.

My questions are, how do you define the conversion rate (do you average the total conversions over the exposure time of one month?), how do you define the trials (is one trial the first visit to the landing page so that a trial is a unique visitor?), and how long do you wait before you stop the test and decide that you have enough conversion data?

Paras Chopra
July 8, 2010

Hi John,

No matter how long your test is running, it won’t affect your conversions. If your visitor converts after 6 months of first getting included in the test, it will still count as a conversion (assuming you have the test still running). There are several calculators available on the Internet one on our site http://visualwebsiteoptimizer.com/ab-split-test-duration/

Using these calculators you can calculate how long to wait for the results before giving up.

John
July 8, 2010

Thanks Paras!

So just to clarify with an example, let’s say I get 10,000 visitors per day for thirty days, and so I have a total of 300K in my test population. Then, over the next 6-8 months, I get different conversions per month, but in the end I get a total of 3,000 conversions. Do I then use n=300K and p=1%? i.e. do I average the TOTAL conversions over the 30 days I created my population even though they take place on very different timelines?

On a related note, are there rules of thumb about the proximity of conversion events to the page affected? To clarify, in my example, I am making a cosmetic change to a landing page. The nearest conversion event is registration – where they create an account on the landing page itself. That is a same-day event, and it makes a lot of sense that my changes in Pick-B might affect the conversion rate. However, if we now go out 6 months where the user has interacted with many different parts of my site, logging in and out, researching, etc. There are many exogenous factors that affect their purchase decision in that time that I have no influence over – life factors, income, age, competitors, etc. Is it really still valid to test to a conversion so far out based on the color of a button (or similar) far upstream?

My hypothesis is that if there is enough separation between the two events – interaction with the landing page and conversion – that even if Pick-A and Pick-B were exactly the same, that I would still likely see a slight difference in conversions between the picks. Are there tests that just don’t make sense to run?

Paras Chopra
July 8, 2010

Hi John,

This is interesting. I think ultimately it is upon the test creator to be aware of what his conversion goals are actually going to mean. A period of six-months is too long a period, however if your test is designed with such a goal in mind, then you could of course take it as a valid goal.

Theoretically if your variations do not have any effect on the six-month goal, you should see no statistical significance in the difference between conversion rates (because visitors were randomly distributed).

But you raise an interesting point that the time horizon must make an impact, perhaps due to sheer chance group A experienced better customer service as compared to group B and that is why they converted (and not because of test variations). More you lengthen the period, more there are chances of such unknown variables impacting different groups.

I don’t have a mathematical theory for this (yet), but is is a very interesting point for sure.

-Paras

Anne Stahl
August 20, 2011

I think there is one basic flaw with the pure mathematical approach or at least with this approach – it doesn’t take trending into account! If I see a test ‘graph’ that has a lot of ‘noise’ (both graphs cutting across each other) in other words, if one day one is winning and the next day the other variation is winning, and so on, despite using cumulative data, then I don’t trust the result. For a result to be truly trustworthy or significant, the ‘noise’ must have subsided and the trend remaining the same. In this way, I’d say there are a lot of folks calling tests ‘significant’ when in fact they are not. There is a lot of noise caused by day time, day of week, holidays, news, etc… and this will muddy your results. I’d love so see a mathematical calculation that takes time/trending into account!

Paras Chopra
August 23, 2011

@Anne: you make a good point and it will be great to capture trending into a mathematical number. However, ‘chance to beat original’ or ‘statistical significance’ talks about results in overall context. With these metrics we want to understand what is the likelihood that variation is performing better as compared to control given a specific sample (over a number of days).

What you are asking is a number that says how consistent is the performance. Those are two different things but nevertheless consistency can be important too.

[...] by default, we declare winning variation if the (statistical) confidence is >95% (here’s the math of A/B testing if you are interested). Now from settings, you can change it to any value you want. So if you want [...]

[...] What you really need to know about mathematics of A/B split testing [...]

Andi
December 12, 2011

Can you explain how you get to this formula, please?
Standard Error (SE) = Square root of (p * (1-p) / n)

I don’t understand how you can calculate the standard error without knowing anything about the variance.
That would be really helpful, thank you!

Paras Chopra
December 12, 2011

@Andi: it is a binomial distribution and for binomial distribution, variance is p * (1 – p)

Rafael
December 13, 2011

How much overlap is allowed between the two distributions to be confident that version B is better?

You said that if there is 1% overlap between conversion rate distribution of control and variation, then there is 99% chance of variation beating the control.

What if there is a 5% overlap? In this case, is there a 95% chance of the variation beating the original?

What about a 6% overlap?

Thanks!

Paras Chopra
December 13, 2011

@Rafael: it depends on how important results are for an organization and how much risk (of being wrong) it is willing to accept. 99% chance to beat original is always better, but if stakes aren’t high some organizations are okay with 95% chance to beat original too.

Rafael
December 13, 2011

Hey Paras, thank you for the answer!

So the “chance to beat the control” can be measured just by measuring this overlap?

For instance, does a 10% overlap mean a 90% chance to beat the control?

And a 15% overlap -> 85% chance
20% overlap -> 80% chance

And so on, so forth. Is this the case, or did I misinterpret it?

Paras Chopra
December 13, 2011

@Rafael: yes, your understanding is correct.

[...] læsning What you really need to know about mathematics of A/B split testing Tweet This entry was posted in Uncategorized. Bookmark the permalink. ← Skal kundens [...]

[...] What You Should Know About the Mathematics of A/B Testing From my own blog. [...]

Komposit
April 11, 2012

Thanks a lot, this was just what I was looking for!

Keep up the good work.

[...] control) is statistically significant. You can read about mathematics of it in previous blog posts: how we calculate statistical significance and how to estimate number of visitors needed for a test. A common misconception is that a winning [...]

[...] finding out if a variation is performing better or worse, we use statistical tests such as a Z-test or a chi-square test, and mathematically (and intuitively) you need to have tested [...]

[...] finding out if a variation is performing better or worse, we use statistical tests such as a Z-test or a chi-square test, and mathematically (and intuitively) you need to have tested [...]

[...] What You Should Know About the Mathematics of A/B Testing From my own blog. [...]

[...] In statistics, a result is statistically significant if it is unlikely to have happened by chance. Click this link to learn more about the mathematics of statistical significance. [...]

[...] can be confusing unless you know exact formulas. Earlier, we had published articles related to mathematics of A/B testing and also have a free A/B testing calculator on the site to see if your results are significant or [...]

[...] It is important to either use a tool which automatically crunches the reliability of results for you, or to use online calculators to gauge the confidence in results. Taking unreliable results and implementing them can actually cause decreased performance. The exact mathematics of what goes on behind split testing reliability analysis can be read in the 20bits article Statistical Analysis and A/B Testing, or my blog article Mathematics of A/B testing. [...]

Juan
July 1, 2013

Hi,

Great article! Thanks for sharing. I have one question, what if I want to use metrics not represented by a binomial or normal distribution?

For instance, what happens if I want to compare control vs variation looking at the metric: visits/user?

Thanks,

J

dragonpanda
July 19, 2013

I have a question about the math that goes in to finding the z-score. On the excel sheet, you used the equation: =(control_p-variation_p)/SQRT(POWER(control_se,2)+POWER(variation_se,2)) which = 1.721671363
However, shouldn’t we use the difference between two proportions (the conversion rates) to find the z-score and see if the difference is not 0? This formula would involve calculating the pooled p (conversion rate)…etc.

Also, to find whether or not it’s significantly different, don’t you have to do 1-p or 2(1-p) (for 2 tails) to find alpha and see if alpha is <= 0.05, 0.01, etc?

Thanks!

Levi
August 28, 2013

Great blog thanks for sharing.

What considerations should be taken into account when the A/B/N test has an uneven traffic split for example in a A/B/C/D test with a traffic split of 70% (existing site) ,10%,10%,10% respectively?

Leave a comment
Required
Required - not published


8 − one =

Notify me of followup comments via e-mail.

RSS feed for comments on this post. TrackBack URL

I ♥ Split Testing Blog


Stay up to date

Recent Posts

Write for this blog!

We accept high quality articles related to A/B testing and online marketing. Send your proposals to contribute@wingify.com