Computing and Mathematical Sciences Colloquium
A/B testing is a hallmark of Internet services: from e-commerce sites to social networks to marketplaces, nearly all online services use randomized experiments as a mechanism to make better business decisions. Such tests are generally analyzed using classical frequentist statistical measures: p-values and confidence intervals. Despite their ubiquity, these reported values are computed under the assumption that the experimenter will not continuously monitor their test---in other words, there should be no repeated "peeking" at the results that affects the decision of whether to continue the test. It is well known that repeated significance testing (and rejection or continuation on the basis of those tests) can lead to very high false positive probabilities---well in excess of the guarantee provided by a significance threshold in the fixed horizon setting.
This poses a serious dilemma for the A/B test practitioner. On one hand, it is clear that classical hypothesis testing is inappropriate in a setting where the test will be continuously monitored; this suggests that perhaps the right practice is to force A/B testers to do power calculations properly before any experiment, commit to the experiment length ex ante, and not "peek" at the results early. On the other hand, one of the greatest benefits of advances in information technology, computational power, and visualization is precisely the fact that experimenters can watch experiments in progress, with greater granularity and insight over time than ever before.
In light of the preceding discussion, we take the point of view that it is the statistical model that should change, not the user. As designers of experimental methodology, continuous monitoring should be treated as a requirement in any testing platform that is developed. So we ask the question: if users will continuously monitor experiments, then what statistical methodology is appropriate for hypothesis testing, significance, and confidence intervals? We present recent work addressing this question. In particular, we present analogues of classical frequentist statistical measures that are valid even though users are continuously monitoring the results. We also extend our results to multiple hypothesis testing.
Joint work with Leo Pekelis and David Walsh. (This work was carried out with Optimizely, a leading A/B testing platform.)