Conversion Rate Optimization Signup | Features | Pricing | Case Studies | Blog | Login

What you really need to know about mathematics of A/B split testing

Recently, I  published an A/B split testing case study where an eCommerce store reduced bounce rate by 20%. Some of the blog readers were worried about statistical significance of the results. Their main concern was that a value of 125-150 visitors per variation is not enough to produce reliable results. This concern is a typical by-product of having superficial knowledge of statistics which powers A/B (and multivariate) testing. I’m writing this post to provide an essential primer on mathematics of testing so that you never jump to a conclusion on reliability of a test results simply on the basis of number of visitors.

What exactly goes behind A/B split testing?

Imagine your website as a black box containing balls of two colors (red and green) in unequal proportions. Every time a visitor arrives on your website he takes out a ball from that box: if it is green, he makes a purchase. If the ball is of red color, he leaves the website. This way, essentially, that black box decides the conversion rate of your website.

A key point to note here is that you cannot look inside the box to count the number of balls of different colors in order to determine true conversion rate. You can only estimate the conversion rate based on different balls you see coming out of that box. Because conversion rate is an estimate (or a guess), you always have a range for it; never a single value. For example, mathematically, the way you describe a range is:

“Based on the information I have, 95% of the times conversion rate of my website ranges from 4.5%-7%.”

As you would expect, with more number of visitors, you get to observe more number of balls. Hence, your range gets narrower and your estimate starts approaching true conversion rate.

The maths of A/B split testing

Mathematically, the conversion rate is represented by a binomial random variable, which is a fancy way of saying that it can have two possible values: conversion or non-conversion. Let’s call this variable as p. Our job is to estimate the value of p and for that we do n trials (or observe n visits to the website). After observing those n visits, we calculate how many visits resulted in a conversion. That percentage value (which we represent from 0 to 1 instead of 0% to 100%) is the conversion rate of your website.

Now imagine that you repeat this experiment multiple times. It is very likely that, due to chance, every single time you will calculate a different value of p. Having all (different) values of p, you get a range for the conversion rate (which is what we want for next step of analysis). To avoid doing repeated experiments, statistics has a neat trick in its toolbox.  There is a concept called standard error, which tells how much deviation from average conversion rate (p) can be expected if this experiment is repeated multiple times. Smaller the deviation, more confident you can be about estimating true conversion rate. For a given conversion rate (p) and number of trials (n), standard error is calculated as:

Standard Error (SE) = Square root of (p * (1-p) / n)

Without going much into details, to get 95% range for conversion rate multiply the standard error value by 2 (or 1.96 to be precise). In other words, you can be sure with 95% confidence that your true conversion rate lies within this range: p % ± 2 * SE

What does it have to do with reliability of results?

In addition to calculating conversion rate of the website, we also calculate a range for its variations in an A/B split test. Because we have already established (with 95% confidence) that true conversion rate lies within that range, all we have to observe now is the overlap between conversion rate range of the website (control) and its variation. If there is no overlap, the variation is definitely better (or worse if variation has lower conversion rate) than the control. It is that simple.

As an example, suppose control conversion rate has a range of  6.5% ± 1.5% and a variation has range of 9% ± 1%. In this case, there is no overlap and you can be sure about the reliability of results.

Do you call all that math simple?

Okay, not really simple but it is definitely intuitive. To save the trouble of doing all the math by yourself, either use a tool like Visual Website Optimizer which automatically does all the number crunching for you. Or, if you are doing a test manually (such as for Adwords), use our free A/B split test significance calculator.

So, what is the take-home lesson here?

Always, always, always use an A/B split testing calculator to determine significance of results before jumping to conclusions. Sometimes you may discount significant results as non-significant solely on the basis of number of visitors (such as you may do for this case study). Sometimes you may think results are significant due to large number of visitors when in fact they are not (such as here). You really want to avoid both scenarios, don’t you?

  1. How reliable are your split test results?
  2. Top 7 split testing blunders you must avoid
  3. How to create an A/B split test in 2 minutes [video]
  4. Using A/B split testing to reduce bounce rate by 20% for an eCommerce store
  5. Four reasons why 2010 is going to be a year of A/B split testing

36 Comments »

  1. Great look at the reliability of A/B test results. When you get into quantification and accountability, many designers–the very people who need to be running A/B tests–get discouraged and never take the time do tests.

    Comment by Brian CrayJanuary 26, 2010 @ 4:39 pm

  2. Can I confirm the maths for this formula with an example? Suppose the control web page has 1000 visitors of which 100 covert (10% conversion rate) while the variation has 1000 visitors and 150 convert (15%). Would the respective SE be:

    Control:
    SQRT(0.1 * (1-0.1) / 1000)= 0.00949
    SE = 0.00949 * 1.96 = 0.0186
    thus 10% ± 1.9% = 8.1% to 11.9%

    Variation:
    SQRT(0.15 * (1-0.15) / 1000)= 0.01129
    SE = 0.01129 * 1.96 = 0.02213
    thus 15% ± 2.2% = 12.8% to 17.2%

    Thus, since there is no overlap, the variation results are reliable.

    Is this correct (or is another number used for n)?

    Comment by DuaneJanuary 28, 2010 @ 11:13 am

  3. Yes, your calculations look fine to me.

    Comment by Paras Chopra — January 28, 2010 @ 11:19 am

  4. Thanks Paras.

    I think what was (and still is) confusing me is that when I tried to verify it using an online calculator (e.g. http://www.dimensionsintl.com/error_calculator.html) for 95% confidence with 1000 for population, 0.1 for proportion and 100 for sample size, it gives me double the ’standard error’ as my calculations above.

    …so I suspect I am misunderstanding something either here or in using other online calculators.

    Comment by DuaneJanuary 28, 2010 @ 12:13 pm

  5. @Duane. That standard error is double in other online calculators because it is +/-. I think they are probably reporting error around mean while in this article I give a range. It is a matter of reporting -x to +x v/s 2x

    Comment by Paras Chopra — January 28, 2010 @ 12:23 pm

  6. @Paras. Thanks – makes sense now. The fact that the difference was half/double give me a suspicion it was something like that, but as I am currently sleep deprived, it wasn’t clicking :-)

    Thanks for a great post and an interesting service – I remember a few years back when services like the Visual Website Optimiser were so expensive, individuals and small companies couldn’t afford them. So nice to see that changing.

    Comment by DuaneJanuary 28, 2010 @ 12:48 pm

  7. Thanks for this great explanation, really helps!

    Comment by Clint — January 30, 2010 @ 2:48 am

  8. (1) I think there is an error in formula used by Duane for SE, Standard Error. There is NO 1.96, the t-value, for 95% confidence in SE formula).
    (2) SE is one standard deviation
    (3) Range is typically reported as +/- t multiplied by SE
    (4) The constant (critical t value 1.96 is an approximation, assuming normality for n=30; the
    critical t value changes as n changes.
    (5) people use Two-sided tests(Are two conversion rates different? (as in here) versus One-sided tests ( Is the new change better than the contol?)
    use of different types of tests will result in different t values. Duane probabaly sees the effects of different n values resulting in different t-values, different type of tests one sided vs two sided.
    (6) The normaility approximation is reasonable when (a) p is small and (b) number of conversions exceed 30, not number of trials.
    Hope this helps,

    Comment by Mukul — February 16, 2010 @ 8:21 pm

  9. hi Paras,

    if I use the standard error formula given in this post, the numbers I get are not matching the standard error in your image.

    I have created an excel spreadsheet here. If there is something wrong with the formula please feel free to make changes: http://spreadsheets.google.com/ccc?key=0AlNACDtsQ-AzdFNzNHBaWHo4aktfUjRIcTJmek9VZXc&hl=en

    Comment by Inventov — March 2, 2010 @ 2:36 am

  10. for the comment above, by image I mean http://visualwebsiteoptimizer.com/split-testing-blog/wp-content/uploads/2010/01/result.png

    Comment by Inventov — March 2, 2010 @ 2:37 am

  11. Hi Inventov,

    Actually, you just calculated SE – remember you need to multiply it with 1.96 to get 95% range of conversion rate. In the image, we show 80% range which corresponds to z-score of 1.28.

    I have made modifications to your excel sheet and numbers do match now. If you make a great A/B testing spreadsheet, I’d love to share it here on this blog.

    -Paras

    Comment by Paras ChopraMarch 2, 2010 @ 9:42 am

  12. Thanks. I’ve updated the file. Paras, can you update the column on chances to beat the original with your formula?

    Comment by Inventov — March 2, 2010 @ 9:53 am

  13. I’m curious to know what the math is for the “chance to beat original”. How does that get decided and is it really accurate?

    Comment by ClayJune 2, 2010 @ 3:42 am

  14. “Chance to beat orginal” simply measures overlap between two distribution. If there is 1% overlap between conversion rate distribution of control and variation, then there is 99% chance of variation beating the control.

    Comment by Paras Chopra — June 2, 2010 @ 3:02 pm

  15. [...] [...]

    Pingback by The Ultimate Guide To A/B Testing - Smashing MagazineJune 24, 2010 @ 3:37 pm

  16. [...] What You Should Know About the Mathematics of A/B Testing From my own blog. [...]

    Pingback by The Ultimate Guide To A/B Testing | Web Design CoolJune 24, 2010 @ 3:45 pm

  17. [...] What You Should Know About the Mathematics of A/B Testing From my own blog. [...]

    Pingback by The Ultimate Guide To A/B Testing | June 24, 2010 @ 3:52 pm

  18. [...] What You Should Know About the Mathematics of A/B Testing From my own blog. [...]

    Pingback by TG Designer » The Ultimate Guide To A/B TestingJune 25, 2010 @ 12:57 am

  19. [...] What You Should Know About the Mathematics of A/B Testing From my own blog. [...]

    Pingback by The Ultimate Guide To A/B Testing « FED视野June 25, 2010 @ 9:53 am

  20. [...] What You Should Know About the Mathematics of A/B Testing From my own blog. [...]

    Pingback by The Ultimate Guide To A/B Testing | i know ideaJune 27, 2010 @ 10:03 am

  21. Good stuff, thanks! I have an additional complication however. How do you define p and n when the conversion event may take place at some future point (i.e. not on the same day)?

    So let’s say you get 10,000 visitors to your site per day who register. Then, at some future point, they may decide to ‘convert’ (make a purchase for example – can only happen once) at any time between the registration date and a year from that date or more. Of those who will convert, most do so within the first 6 months, and then the conversions trail off. How do you set up this experiment?

    Say you expose two groups, Pick-A and Pick-B to two different landing pages and you want to determine the effect of the landing page on the ultimate conversion. So you create a “class” which you define as anyone who visits the landing pages for one month. At that point, the class is defined, but the test continues because they have not yet converted.

    My questions are, how do you define the conversion rate (do you average the total conversions over the exposure time of one month?), how do you define the trials (is one trial the first visit to the landing page so that a trial is a unique visitor?), and how long do you wait before you stop the test and decide that you have enough conversion data?

    Comment by John — July 8, 2010 @ 4:46 am

  22. Hi John,

    No matter how long your test is running, it won’t affect your conversions. If your visitor converts after 6 months of first getting included in the test, it will still count as a conversion (assuming you have the test still running). There are several calculators available on the Internet one on our site http://visualwebsiteoptimizer.com/ab-split-test-duration/

    Using these calculators you can calculate how long to wait for the results before giving up.

    Comment by Paras Chopra — July 8, 2010 @ 11:45 am

  23. Thanks Paras!

    So just to clarify with an example, let’s say I get 10,000 visitors per day for thirty days, and so I have a total of 300K in my test population. Then, over the next 6-8 months, I get different conversions per month, but in the end I get a total of 3,000 conversions. Do I then use n=300K and p=1%? i.e. do I average the TOTAL conversions over the 30 days I created my population even though they take place on very different timelines?

    On a related note, are there rules of thumb about the proximity of conversion events to the page affected? To clarify, in my example, I am making a cosmetic change to a landing page. The nearest conversion event is registration – where they create an account on the landing page itself. That is a same-day event, and it makes a lot of sense that my changes in Pick-B might affect the conversion rate. However, if we now go out 6 months where the user has interacted with many different parts of my site, logging in and out, researching, etc. There are many exogenous factors that affect their purchase decision in that time that I have no influence over – life factors, income, age, competitors, etc. Is it really still valid to test to a conversion so far out based on the color of a button (or similar) far upstream?

    My hypothesis is that if there is enough separation between the two events – interaction with the landing page and conversion – that even if Pick-A and Pick-B were exactly the same, that I would still likely see a slight difference in conversions between the picks. Are there tests that just don’t make sense to run?

    Comment by John — July 8, 2010 @ 9:05 pm

  24. Hi John,

    This is interesting. I think ultimately it is upon the test creator to be aware of what his conversion goals are actually going to mean. A period of six-months is too long a period, however if your test is designed with such a goal in mind, then you could of course take it as a valid goal.

    Theoretically if your variations do not have any effect on the six-month goal, you should see no statistical significance in the difference between conversion rates (because visitors were randomly distributed).

    But you raise an interesting point that the time horizon must make an impact, perhaps due to sheer chance group A experienced better customer service as compared to group B and that is why they converted (and not because of test variations). More you lengthen the period, more there are chances of such unknown variables impacting different groups.

    I don’t have a mathematical theory for this (yet), but is is a very interesting point for sure.

    -Paras

    Comment by Paras ChopraJuly 8, 2010 @ 10:59 pm

  25. [...] [...]

    Pingback by Multivariate Testing in Action: Five Simple Steps to Increase Conversion Rates - Smashing MagazineNovember 24, 2010 @ 5:33 pm

  26. I think there is one basic flaw with the pure mathematical approach or at least with this approach – it doesn’t take trending into account! If I see a test ‘graph’ that has a lot of ‘noise’ (both graphs cutting across each other) in other words, if one day one is winning and the next day the other variation is winning, and so on, despite using cumulative data, then I don’t trust the result. For a result to be truly trustworthy or significant, the ‘noise’ must have subsided and the trend remaining the same. In this way, I’d say there are a lot of folks calling tests ’significant’ when in fact they are not. There is a lot of noise caused by day time, day of week, holidays, news, etc… and this will muddy your results. I’d love so see a mathematical calculation that takes time/trending into account!

    Comment by Anne StahlAugust 20, 2011 @ 3:36 am

  27. @Anne: you make a good point and it will be great to capture trending into a mathematical number. However, ‘chance to beat original’ or ’statistical significance’ talks about results in overall context. With these metrics we want to understand what is the likelihood that variation is performing better as compared to control given a specific sample (over a number of days).

    What you are asking is a number that says how consistent is the performance. Those are two different things but nevertheless consistency can be important too.

    Comment by Paras Chopra — August 23, 2011 @ 7:58 pm

  28. [...] by default, we declare winning variation if the (statistical) confidence is >95% (here’s the math of A/B testing if you are interested). Now from settings, you can change it to any value you want. So if you want [...]

    Pingback by Introducing frequently used goals, test thresholds, custom currency and more!October 9, 2011 @ 3:43 am

  29. [...] What you really need to know about mathematics of A/B split testing [...]

    Pingback by Dossier metrics part4, pour aller plus loin: sources et ouvrages | Clement vouillonOctober 31, 2011 @ 7:28 pm

  30. Can you explain how you get to this formula, please?
    Standard Error (SE) = Square root of (p * (1-p) / n)

    I don’t understand how you can calculate the standard error without knowing anything about the variance.
    That would be really helpful, thank you!

    Comment by Andi — December 12, 2011 @ 1:51 pm

  31. @Andi: it is a binomial distribution and for binomial distribution, variance is p * (1 – p)

    Comment by Paras Chopra — December 12, 2011 @ 1:58 pm

  32. How much overlap is allowed between the two distributions to be confident that version B is better?

    You said that if there is 1% overlap between conversion rate distribution of control and variation, then there is 99% chance of variation beating the control.

    What if there is a 5% overlap? In this case, is there a 95% chance of the variation beating the original?

    What about a 6% overlap?

    Thanks!

    Comment by RafaelDecember 13, 2011 @ 4:27 am

  33. @Rafael: it depends on how important results are for an organization and how much risk (of being wrong) it is willing to accept. 99% chance to beat original is always better, but if stakes aren’t high some organizations are okay with 95% chance to beat original too.

    Comment by Paras Chopra — December 13, 2011 @ 1:16 pm

  34. Hey Paras, thank you for the answer!

    So the “chance to beat the control” can be measured just by measuring this overlap?

    For instance, does a 10% overlap mean a 90% chance to beat the control?

    And a 15% overlap -> 85% chance
    20% overlap -> 80% chance

    And so on, so forth. Is this the case, or did I misinterpret it?

    Comment by RafaelDecember 13, 2011 @ 5:45 pm

  35. @Rafael: yes, your understanding is correct.

    Comment by Paras Chopra — December 13, 2011 @ 5:59 pm

  36. [...] læsning What you really need to know about mathematics of A/B split testing Tweet This entry was posted in Uncategorized. Bookmark the permalink. ← Skal kundens [...]

    Pingback by A/B spilttest giver ikke et brugbart resultatDecember 31, 2011 @ 5:56 pm

RSS feed for comments on this post. TrackBack URL

Leave a comment

Get email updates (it's free)

Get email updates

Or subscribe blog via RSS



Search Blog


Latest Posts



Visual Website Optimizer


Latest Tweets Follow Wingify on Twitter

Follow Wingify on Twitter