The Dark Side of A/B Testing - Avoid the Pitfalls
As product designers and marketers, we love the clarity that comes from A/B testing. We have an idea; we implement it; then, we let the customers decide whether it’s good or not by running a controlled experiment. We split the users into unbiased groups and watch one treatment overtake the other in a statistically significant, unbiased sample.
Thoroughly impressed by our own analytical rigor, we then scale up the winning treatment and move on. You guessed it: I’m about to poke holes in one of the most sacred practices in tech, A/B testing.
But let’s start by acknowledging why A/B testing is a good (and important) technique for your business. If you don’t do any A/B testing today, you’re clearly behind the curve: you’re making HIPPO-driven decisions (those driven by a Highly Paid Person’s Opinion). The problem is, even if you very much respect your HIPPO, s/he is human and as such, extremely biased in judgment.
If the only thing you go by is HIPPO’s intuition, you’re blindly exploring the edge of the Grand Canyon. You can get lucky and not fall. Or, like the folks at MSN.com in early 2000’s, you can launch a massive redesign of the site that was all driven by their HIPPO. And watch all business metrics tank immediately afterwards… And not know what caused the fall. And not be able to roll-back quickly.
But I know you, you’re better than this. You do A/B testing and you challenge your own assumptions – as well as the assumptions of your HIPPO. Wait, you’re not out of the woods either. Let’s explore a few pitfalls.
1. Temporal effects: what worked today may not work a month from now.
Each A/B test, by definition, has a duration. After a certain amount of time passes (which you, of course, determined by calculating an unbiased, statistically significant sample size), you make a call. Option B is better than option A. You then scale up option B and move on with your life, onto the next test.
But what if user behavior was different during your test period? What if the novelty of option B is what made it successful? And then, after a few months pass, this option will be ineffective? Concrete example from Grubhub: we make restaurant recommendations; there’s a baseline algorithm A that we show on our site. When we roll out a challenger algorithm B, one reason why the challenger can win is because it simply exposes new options. But will this lift be sustained? What if users just try those new restaurants recommended by restaurant B, and then stop paying attention to recommendations, just like they did for algorithm A?
There’s a flip side to this. In Facebook’s case, with the Newsfeed, any meaningful modification causes core metrics to tank – simply because the customer base is so used to the old version. So you’d reject EVERY test if you were to end it after a week – Facebook users produce more than enough of a sample size to end every test after a week! This, of course, would be a mistake, because you haven’t waited long enough for the users’ reaction to stabilize.
One possible reaction to this is “can’t I just run all my A/B tests forever?” That is, what if after a month, I scale up the winning option B to 95%, and keep the other option at 5% – and keep monitoring the metrics? This way, I’m capturing most of the business benefit of the winning option; and I can still react if the pesky temporal effect bites me. Yes, you can do that; you can even do an advanced version of this approach, a multi-armed bandit, in which your A/B testing system homes in on the best option automatically, continuously increasing the exposure of the winning variant.
However, there’s one significant issue with this method: it pollutes your codebase – having to fork the logic makes the code hard to maintain. Also, it makes it very difficult to experience your product in the same way a customer would – this proliferation of user experience forks creates nooks and crannies that you just never test and stumble upon. Also known as bugs. Thus, don’t do it for a long time and don’t do this with every test.
One other possible defense is to rerun your tests occasionally. Confirm that winners are still winners, especially the most salient ones from many moons ago.
2. Interaction effects: great separately, terrible together
Imagine you’re working in a large organization that has multiple work streams for outbound customer communications – that is, emails and push notifications. One of your teams is working on a brand new “abandoned cart” push notifications. Another is working on a new email with product recommendations for customers. Both of these ideas are coded up, and are being A/B tested at the same time. Each of them wins – it improves your business metrics. So you scale both. Then BOOM, the moment you scale both, your business metrics tank.
WHAT?!? How can that be?? We tested each of the options!! Well, this is happening because you’re over-messaging your customers. Each of the ideas separately didn’t cross that barrier, but together, they do. And the effect of annoyance (why are these guys pinging me so much?!?!) is overtaking the positive.
You’ve just experienced another rub of A/B testing. There’s a built-in assumption in the framework that tests are independent, that they don’t affect each other. As you can see from the example above, this assumption can be false – and in ways that aren’t as obvious.
To make sure this doesn’t happen, have someone be the responsible party for the full set of A/B tests that are going on. This person will be able to call out potential interaction effects; if you see one, just sequence the relevant tests instead of parallelizing them.
3. The pesky confidence interval: the more tests you run, the higher the chance of error
If your organization culturally promotes the idea of experimentation, one “wrong” way it can manifest is by folks running a whole bunch of tiny tests. You know those: increase the font size by 1 point, swap the order of the two modules, change a couple words in the product description. Besides the fact that these changes will most likely not allow your organization to become the visionary of your industry (heh), there’s a poorly-understood statistics issue biting you here, too.
Every time you judge an A/B test, and claim option B to be better than option A, you’re running a statistical calculation based on a t-test. Inside that calculation, there’s a concept of a confidence interval: the level of certainty that you are comfortable with. Set it at 90%, and 10% of the conclusions that your A/B testing framework gives you will be wrong – it’ll say that option B is better than option A, while in reality, it’s not the case.
Now, what happens if you run 20 tiny tests, each with a 10% probability of a false positive? Your chance of finding a winner by mistake is then (1 – 90% to the power of 20).. That is, 88%. That’s right, your A/B testing framework will show you at least one, and likely two “fake” winners from your set of 20 significant-result tests. Thus, possibly, providing a feedback loop to the experimenting team that there’s indeed gold there.
How do you avoid this issue? Have someone look at the list of tests. Disallow a zillion tiny modifications. Be extra-cautious if you’re testing version eight of a concept that just keeps failing.
We’ve explored some tactical issues you may discover when you adopt A/B testing as a philosophy for your marketing and product teams. These aren’t trivial faux-pas that amateurs succumb to… they’re surprisingly common. Make sure to inoculate your team from them.
This article was originally published on VentureBeat.