The Next Generation of Visual Website Optimizer is launching in April 2014 See What's Coming

Appsumo reveals its A/B testing secret: only 1 out of 8 tests produce results

Posted in News on February 15th, 2011

This is the 2nd article in the series of interviews and guest posts we are doing on this blog regarding A/B testing and conversion rate optimization. In 1st article, we interviewed Oli from Unbounce on Landing Pages Best Practices.

Editor’s note: this guest post is written by Noah Kagan, founder of web app deals website Appsumo. I have known Noah for quite some time and he is the goto person for any kind of marketing or product management challenges. You can follow him on Twitter @noahkagan. In the article below Noah shares some of the A/B testing secrets and realities that he discovered after doing hundreds of tests on Appsumo.

Only 1 out of 8 A/B tests have driven significant change

AppSumo.com reaches around 5,000 visitors a day. A/B testing has given us some dramatic gains such as increasing our email conversion over 5x and doubling our purchase conversion rate.

However, I wanted to share some harsh reality about our testing experiences. I hope sharing this helps encourage you not to give up with testing and get the most out of it. Here’s a data point that will most likely surprise you:

Only 1 out of 8 A/B tests have driven significant change.

That’s preposterous. Not just a great vocab word but a harsh reality. Here are a few tests from us that I was SURE would produce amazing results only to disappoint us later.

A/B test #FAIL 1

Hypothesis: Title testing. We get a lot of traffic to our landing page and having a more clear message will significantly increase conversions.

Result: Not-conclusive. We’ve tried over 8 versions and so far not one has produced any significant improvement.

Why it failed: People don’t read. (Note: the real answer here is “i dont know why it didn’t work out, thats why im doing AB testing”)

Suggestion: We need more drastic changes to our page like showing more info about our deals or pictures to encourage a better conversion rate.

A/B test #FAIL 2

Hypothesis: Having a tweet for a discount pop-up in a light-box vs someone having to click a button to tweet. Assumed reducing a click and putting it (annoyingly) in front of someones face will encourage more tweets.

Result: 10% decrease with light-box version.

Why it failed: ANNOYING. Totally agree. Also, it was premature as people had no idea about it nor were interested in tweeting at that moment.

Suggestion: Better integrate peoples desire to share into our site design.

A/B test #FAIL 3

Hypothesis: A discount would encourage more people to give us their email on our landing page.

Result: Fail. Decreased conversion to email on our landing page.

Why it failed: An email is a precious resource and we are dealing with sophisticated users. Unless you are familiar with our brand which is a small audience then you aren’t super excited to trade your email for % off.

Suggestion: Give away $ instead of % off. Also, offer the % off with examples of deals so they can see what they could use it for.

Thoughts on failed A/B tests

All of these were a huge surprise and a disappointment for me.

How many times have you said, “this experience is 100x better, I can’t wait to see how much it beats the original version?”

A few days later you check your testing dashboard to see it actually LOSING.

Word of caution. Be aware of premature e-finalization. Don’t end tests before data is finalized (aka statistically significant).

I learned the majority of my testing philosophy at SpeedDate where literally every changed is tested and measured. SO MANY times my tests initially blew the original version away only to find out a few days later that a) the improvement wasn’t as amazing after all or b) it actually lost.

How can you get the most out of your tests?

Some A/B testing tips based on my experience:

  • Weekly iterations. This is the most effective way I’ve found to do A/B testing.
    • Pick only 1 thing you want to improve. Let’s say it’s conversion rate to buying on first time visitor.
    • Get a benchmark of what that conversion rate is
    • Do 1-3 tests per week to increase that
    • Do it every week until you hit some internal goal you’ve set for yourself

    Most people test 80 different things instead of 1 priority over and over. It simplifies your life.

  • Patience. Realize to get results it may take a few thousand visits or 2 weeks. Pick bigger changes to test so you aren’t waiting around for small improvements.
  • Persistence. Knowing that 7 out of 8 of your tests will produce insignificant improvements should comfort you that you aren’t doing it wrong. That’s just how it is. How badly do you want those improvements? Stick with it.
  • Focus on the big. I say this way too much but you still won’t listen. Some will and they’ll see big results from this. If you have to wait 3-14 days for your a/b tests to finish then you’d rather have dramatic changes like -50% or 200% than a 1-2% change. This may depend on where you are in your business but likely you aren’t Amazon so 1% improvements won’t make you a few million dollars more.

If you like this article follow @appsumo for more details and check out Appsumo.com for fun deals.

Editor’s note: Hope you liked the guest post. It is true that many A/B tests produce insignificant results and that’s precisely the reason that you should be doing A/B testing all the time. For next articles in this series, if you know someone whom I can interview or want to contribute a guest post yourself, please get in touch with me (paras@wingify.com).

Paras Chopra

CEO and Founder of Wingify by the day, startups, marketing and analytics enthusiast by the afternoon, and a nihilist philosopher/writer by the evening!

The Complete A/B Testing Guide

Know all that you need to get started:

  • What is A/B Testing?
  • Is it compatible with SEO?
  • How to start your first A/B test?
Show Me The Guide


27 Comments
Matt
February 15, 2011

1) You’re not getting enough traffic to drive results. 5000 visitors a day is tiny.

2) 10% change is meaningful, not a “Fail”..

Paras Chopra
February 15, 2011

@Matt: there is no thumb rule for traffic. You can even get significant results at 100 visitors a day. It depends on a lot of traffic.

10% change was in negative direction. It was a fail for sure.

yachris
February 15, 2011

I think it’s cool to see (A) that you tried so many things and (B) you really thought about why they did or didn’t work. I just ran some advertisements, and it was a nigh-total fail; but I learned a lot, and thought through some things that I wouldn’t have *had* to think through otherwise.

We’re taught about Edison and his 10,000 (or whatever) light-bulb failures until he found the one solution that worked, but most folks don’t realize the same idea can apply to their work.

Tal Raviv
February 15, 2011

Great post – thank you for being so open. Can you elaborate on “focus on the big”? It sounds like an important point – did you mean getting statistically significant experiments, or choosing major A/B changes rather than the ornamental?

Paras Chopra
February 15, 2011

@Tal: I will let Noah comment on this but according to my understanding, he meant doing big, bold changes in A/B tests rather than small changes like changing color of headline and stuff.

Chuck
February 15, 2011

Very informative. Can you provide any examples of A/B testing working positively for you?

Nate
February 15, 2011

Nice. I don’t think these are “failures” though – knowing that you were doing the right thing in the first place is certainly valuable information.

noah kagan
February 15, 2011

@Tal

@Matt who left the first comment said our 5,000 / visitors a day was small and he’s right.

Point being is you want to go for biggest wins, especially if you have small traffic amounts since it’ll take longer to get definitive results.

Too many people are testing for minor changes like button color for increasing conversion when they only have 100 visitors a day. For example, if each buyer on your site is worth $10,000 and you have 10 visits a day. It’s way more ROI to focus on growth than conversion or retention.

I tend to aim for the most drastic changes and then scale back form there. Here’s a great article from Seth Godin about how people are testing too much, http://sethgodin.typepad.com/seths_blog/2011/01/a-culture-of-testing.html

Good luck.

Lance Jones
February 16, 2011

Nice post (thank you, Paras and Noah) — love the sharing of what works and what doesn’t.

I wouldn’t be so quick to give up on some of your testing ideas, Noah. In my opinion, it was primarily the execution on validating the hypotheses that ‘failed’ here (sorry… I am a direct person).

People absolutely do read headlines. They’re a great opportunity for conversion optimization… but you must first have a strong grasp on the factors that infuence conversion — like motivation, value proposition, anxiety, etc.

The headlines you tested above are merely tag lines — and tag lines have a low probability to increase conversion. You should go back to the drawing board and come up with some headlines that tap into the motivation of your site visitors. First learn about what motivates people to look for your (or similar) solutions… and then amplify that learning through great, clear copy in your headline test!

Lance

Paras Chopra
February 16, 2011

@Lance: thanks for your inputs! As we were discussing, I will be following up with you for an interview in this series. Will be excited to hear your point of view.

John Quarto-vonTivadar
February 18, 2011

I agree with Lance’s last comment. The purpose of testing is not to find out what works, but rather to find out what does NOT work. The tests by Noah reveal a rather large amount of information and insight towards future testing. In fact, when a test “works” — and I use quotes on that to mean “does what we wanted it to do by supporting the hypothesis in some way” — we often learn *less* because we over-interpret the success. As Lance also pointed out , it isn’t the headlines (or the pop-up) that is the problem, it’s the contextual basis under which they were presented. To paraphrase Bill S. : “The fault likes not in our tests, but in ourselves”. That is where you go to find actual insight that ends up leading to better tests (“what assumptions did I build into that test, and are they all valid?”, “if I were sitting across the table from this prospect they would need X Y and Z at this point to continue — so is my test creating a roadblock to that? (crappy headlines, premature popups” etc

@Paras: There ARE rules of thumb for traffic, which is that the more homogeneous the traffic, the smaller the variance you can expect in the sample of visitors versus the population of visitors. If you had a site that was geared towards something specific — say, late stage Lung cancer patients — you don’t need nearly as large a set of traffic to get meaningful results than with a broader spectrum of, say, eBay shoppers. That is not a trivial meme to keep in mind as not only the size of your test samples will be driven by that concern but also the frequency of the tests and the overall testing schedule you keep.

So while 5000 visitors per day is small when all one is doing is comparing how big your set of visitors is versus mine, the real measure is how segmented is the audience and does there exists large discrepancies for what is needed for each segment in order to proceed.

Paras Chopra
February 18, 2011

@John: thanks for your detailed comment. By thumb rules I meant, you can’t throw around a figure like one needs at least 1000 visitors a day to get statistically significant results (irrespective of knowing what conversion goal is being measured and what is the kind of traffic being sent to the test page).

Chris Rowett
February 18, 2011

I love that others share my frustration, it restores my confidence that im not just missing the point.

My last test revealed a very unexpected win. I took a very text heavy page that was supposed to be a pricing breakdown. I ran some de-cluttered versions and thought I better just have a completely sparse version with literally just the prices and no info.

Then I created a version with a clear call to action and a version with supporting information about next steps when you’ve chosen the right pricing model. I was really excited about the last one which seemed to be the solution to a lot of negative feedback my user testing had produced…

I’m sure you can imagine what happened. I’m still scratching my head as to why the completely sparse version won with a 30% uplift – back to the drawing board!

Lance Jones
February 18, 2011

@Chris, great example of a ‘head scratcher’. :-) It’s worth spending the time trying to figure it out — otherwise you’re taking a major leap of faith if you simply try to apply the same visual design and copy elements to other pages. If you have the traffic to support it, I recommend running a multivariate test to attempt to deconstruct the results… so that you can learn from them.

Lance

John Quarto-vonTivadar
February 18, 2011

@Chris: The first thing you need to do is to repeat the test. You have to convince yourself — and you can do this numerically — that the sample set of visitors of your test is representative from the population of your visitors as a whole. Or, more simply, did you just get a goofy mix of folks in the first test? One can’t really know this from just one test, though there are ways to sniff out some confidence levels.

I’d suggest repeating with lesser traffic since at the end of the day, when you subject any visitors to a less optimized experience you’re costing yourself some money…what you’re looking to do is to see if the results are different, while costing yourself as little as possible while still getting meaningful results. It’s definitely a balancing act!

Further, back to the issue of “are there rules for total amount of traffic for a test?” which was touched on earlier, I’d also comment that if someone had, say, 5000 visitors taking a test, I’d much rather see the results of 10 of the same tests of 500 visitors each, than one big test of 5000. The challenge with conversion rates that are low, is that you have to expose a larger number of people to the test to tease out insight into what are typically 1-2-3% conversion rates. This means the signal to noise ratio is rather low :( , but the same techniques as in polling (“Obama 51%, Generic Republican 49%”) are useful.

So the big challenge is first to “test your tests” by repeating them — because if you get a randomly skewed sample of visitors, it will completely throw off your interpretation of the test results. You’re aiming for Directionally Correct, not Metaphysical Certitude.

craig sullivan
February 19, 2011

Hi,

This article really kinda makes testing look dumb and randomly directed. I’m testing in quite a few countries (we operate in 35) and trying 8-100,000 odd variables per test. I do directed tests with inputs from clicktale, usability research, web analytics, previous tests, copywriters etc. etc.

In our case, only one experiment in the last 2 years has failed to give a positive result, and that was for one week only. A lot of this is down to test design but also because we use multi-variate testing.

The problem with running multiple A/B tests at different *time* or *traffic* mixtures, is that it might completely change the outcome. You could, in theory, test all those headlines at different times and find completely baffling results.

At least if you are doing multi-variate, you can then play with the variables and *how they interact*, at the same *time* with the *same traffic mix*.

I note the other comments about sample size and you’ve hit this with your early ‘predictions’ that were premature. You need to get very high confidence levels, especially if the results in the A/B test are close. They’re going to be close because you’ve done simple variables without huge changes. Ergo, you’ve made it harder to ‘see’ what will ‘push’ the conversion rate.

If you’re not quoting confidence levels and intervals, you’re not seeing how reliable the result might be – you need these figures as much as any lift figures. Also, if your business has a weekly or seasonal pattern, you need to test with one of these.

Last but not least, watch the traffic mix. If this changes, so will the results.

And remember to check the A/B results post the online funnel. What is the long term value of the customer across the lifetime?

Paras Chopra
February 19, 2011

@Craig: great reply! Thanks for taking time to add extra value to this conversation.

[...] is expensive and often won’t lead to a relevant enough change. AppSumo recently found that only 1 out of 8 tests produce results. So what’s the point at slaving away, making minor tweaks, testing, and the ROI could [...]

[...] is expensive and often won’t lead to a relevant enough change. AppSumo recently found that only 1 out of 8 tests produce results. So what’s the point at slaving away, making minor tweaks, testing, and the ROI could actually be [...]

[...] This is the 3rd article in the series of interviews and guest posts we are running on this blog regarding A/B testing and conversion rate optimization. In 1st article, we interviewed Oli from Unbounce on Landing Pages Best Practices. In 2nd article, Noah Kagan of Appsumo shared his A/B testing tips. [...]

[...] Dans 7 cas sur 8, les tests A/B ne donnent pas de résultats probants. A l’opposé, dès les premiers tests utilisateurs 50% des problèmes sont identifiés- 5 tests permettant même de relever jusqu’à 80% des problèmes d”utilisabilité. [...]

Meta Brown
January 17, 2012

“People don’t read?” The fact that copy changes didn’t make a measurable difference in conversion doesn’t prove that. Maybe you just didn’t come up with motivating copy.

David
April 17, 2012

If you sell products that are easy to price shop by a brand or a model number, your conversion results may get skewed by any promotions that competitors may launch or stop during your test.

Deric
June 1, 2012

Following on the point from craig sullivan:
“In our case, only one experiment in the last 2 years has failed to give a positive result, and that was for one week only.”

the point about testing is not so much about just getting a positive or negative result,the end goal is testing your business hypothesis and at times its not just about “tweaking elements here and there” but what it is indicating other areas of your business which might need attention to (i.e. is there an issue with your product structure,customer service areas etc.)

[...] Appsumo reveals its A/B testing secret: only 1 out of 8 tests produce results by Noah Kagen – this is a great case study referenced above. [...]

[...] Appsumo reveals its A/B testing secret: only 1 out of 8 tests produce results by Noah Kagen – this is a great case study referenced above. [...]

[...] (new variations produce either no change or perform poorer). Appsumo founder Noah Kagan has said this about their experience: Only 1 out of 8 A/B tests have driven significant [...]

Leave a comment
Required
Required - not published


− three = 2

Notify me of followup comments via e-mail.

RSS feed for comments on this post. TrackBack URL

I ♥ Split Testing Blog


Stay up to date

Recent Posts

Write for this blog!

We accept high quality articles related to A/B testing and online marketing. Send your proposals to contribute@wingify.com