Fantasy vs the Real World: Naive Bayesian AB Testing vs Proper Statistical Inference

This post is addressed at a certain camp of proponents and practitioners of A/B testing based on Bayesian statistical methods, who claim that outcome-based optional stopping, often called data peeking or data-driven stopping, has no effect on the statistics and thus inferences and conclusions based on given observed data. Others do recognize it to be an issue and attempt to correct for it, to varying degrees of success, as far as one can deduce from their technical papers. I cover major tools used in AB testing, particularly in conversion rate optimization and landing page optimization: Visual Website Optimizer (VWO), Optimizely, and Google Analytics Content Experiments.

To better illustrate the issue, below I provide a few quotes from the “optional stopping doesn’t matter” camp. I then make a short argument on why optional stopping matters for Bayesian methods and cover approaches in several tools, namely if and how they address the optional stopping issue. I do so to the extent I can analyse them as an outsider, namely based on the technical papers published by the providers and by other facts that can be gleaned from the tools themselves. The post ends with some quotes from prominent statisticians of all stripes and colors, as well as recommended readings on the topic for those who want to dig in deeper.

## Claims About Immunity from Peeking in A/B Tests

Proponents of the view that Bayesian A/B and MVT tests are immune to the peeking effect include statisticians and philosophers of science, as well as AB testing practitioners and gurus. Take for example Evan Miller, who says, in an otherwise very good post, the following:

With Bayesian experiment design you can stop your experiment at any time and make perfectly valid inferences. Given the real-time nature of web experiments, Bayesian design seems like the way forward.

Similarly, Chris Stucchio, a data scientists for VWO, writes with regards to a Bayesian approach:

This A/B testing procedure has two main advantages over the standard Students T-Test. The first is that unlike the Student T-Test, you can stop the test early if there is a clear winner or run it for longer if you need more samples.

Frank Portman – a data scientist from Uber, has this passage in the description of an R package for AB testing he recently published:

Bayesian tests are also immune to ‘peeking’ and are thus valid whenever a test is stopped.

Deng, Liu & Chen from Microsoft state in their 2016 paper “Continuous Monitoring of AB Tests without Pain – Optional Stopping in Bayesian Testing”, among other things*:

…the Bayesian posterior remains unbiased when a proper stopping rule is used.

with “proper” being defined to include what is described as “optional stopping” below in the “proper” category.

These are just some of the more recent and more notable examples and I’ve kept them true to context.

## Effect of Optional Stopping on Statistics and Inference

The effect of outcome-based optional stopping was noted as early as 1969 in a famous work by Armitage et al. (1969)^{ [1]} called “Repeated Significance Tests on Accumulating Data”. In it we find estimates of the increase in error probability to which such kind of a stopping rule leads to, and it is measured in the multiples of the nominal error probability (usually measured as statistical significance). For example, peeking twice using an outcome-based stopping rule results in an actual error probability that is **more than twice the nominal one**. Peeking **5 times** results in **~3 times larger** than nominal error probability, while **if we peaked 10 times it’s ~4 times larger**. The results are confirmed by both numerical integration and straightforward simulations.

Examples of an often-employed outcome-based stopping rule is checking for an observed naive p-value, confidence interval, Bayes factor, posterior probability or similar statistics and basing the decision to stop or continue the test based on that observation. Armitage focused on p-values and I will do so as well, but our conclusions are true for any statistic, no matter the method used to derive it, as I’ll demonstrate. In terms of effect on decision-making, the result from using an outcome-based stopping rule without compensating for it is usually that a winner is declared when there is severely less actual versus perceived input from the data at hand.

I’ve already gone to significant lengths to explain the effect of optional stopping on all statistical measures of evidence through it’s effect on the sample space in my post “The Bane of AB Testing: Reaching Statistical Significance” so you are advised to read it if you want a deeper understanding of the issue.

Briefly put, the reason why not accounting for the stopping rule leads to bias and much higher than actual error rates is that in doing so one is failing to include crucial information into the model about the procedure that generated the data. That information consists of the fact that optional stopping was attempted and it failed until a given moment. This observed data, if properly taken into account, would shift the sample space significantly and result in very different conclusions, even if the same number of users and conversion rates were observed at the end of a test.

*This applies to frequentist and Bayesian methods alike.*

**There is no Bayesian practice that let’s you close your eyes and pretend a crucial part of the observed data is simply not there.**

That’s simple to demonstrate, as no matter how it is constructed, a posterior probability for a particular set of end results is by definition a direct product of the prior probability and the data input from a particular test. Ignoring a part of the data is not warranted in any reading of the Bayes theorem.

When viewed this way it becomes obvious we’ve changed the question from (in Bayesian terms):

what is the probability of H, given X_{0}, written (P(H|X_{0}))

to

what is the probability of H, given X_{0 }and T_{0}, written (P(H|X_{0 }and T_{0}))

where H is a hypothesis of interest, X_{0} is the observed data at the end of the test and T_{0} denotes a particular stopping time from a series of possible stopping times T. As a consequence of the stopping rule, T_{0} also signifies the fact that we have observed a series of n “non-results” (statistic was less than a certain extreme threshold, whatever the threshold or statistic) before finally stopping the test at T_{0}.

The first question is about the probability of a hypothesis conditional on observing an extreme statistic just at the end of the test, the second is clearly the probability of a hypothesis conditional on the **joint probability** of observing a series of n “non-results” **and** of finally observing a given extreme statistic and stopping. The joint probability of two events happening is always smaller than the marginal probability of just one event happening (holds for any n > 1) which means if we peaked and failed to observe an extreme statistic there is less chance that H is actually true.

**Thus the extra data from peeking works against the probability that there is a real difference between the hypothesis of interest and in favor of there being no or smaller difference between them.**

To use a metaphor, say we flip a coin 10 times and fail to observe a heads, and then on the 11th flip we observe one heads. Does this series of observations in itself support a conclusion that a coin is fair more than a conclusion that the coin is biased? The answer is obviously the latter, as if the coin was fair it would have been highly unlikely that we would observe 10 tails in a row. If we take this example to the case of AB testing, each flip would correspond to an observation of the data with the intent to stop. It would mean that a high naive posterior (significant result) obtained after trying 10 times is much weaker evidence, or decision guidance, than an equally high naive posterior that was obtained from just a single try.

**It should be clear at this point that to refuse to account for optional stopping is equal to refusing to use part of the data as evidence for or against a hypothesis.**

## How Do Different A/B Testing Tools Account for Optional Stopping?

Even though all three tools covered use Bayesian methods for A/B testing, there are in fact three approaches to the issue used by the three key players: VWO, Optimizely and Google Analytics Content Experiments, each taking a different route.

**VWO’s** approach, as far as I understand it from their technical paper ^{[12]}, doesn’t try to adjust error probabilities accounting for optional stopping at all, hence seems to be subject to issues of underestimating of uncertainty in the reported “credible intervals”. I was not able to assess their approach fully, since their paper doesn’t really cover that topic. The most insight into how they deal with it is through a comment by their chief data scientist Chris Stucchio in which he admits that **optional stopping inflates the error rates**, but adds that their loss function is adjusted for optional stopping in an unspecified manner:

Also, the point is well noted that peeking does affect the error rates, though the loss does remain below the threshold.

However, even though he also mentions that “If there is a cost to switching, then the ideal way to handle this is to build it directly into the loss function.” it remains **unclear** to me if and how a user has an input on that loss function calculation, or even knows that VWO is trying to control it.

I’m willing to go on a limb here and say that no loss function can account for the real-world complexity of decision making. While the loss function might be controlling the loss levels at predefined values, it only does so in the super narrow scope of the conversion data available to the tool while lacking any context whatsoever. I’m also not sure how many of VWO’s users understand that the tool they are using does not actually provide error control, but loss function control, which can potentially be an issue.

Maybe if a test is for something super trivial like the color of a button in the site’s header and it won’t be used for anything besides that. The loss function might work just fine in that case. However, what if this test is a pilot test for a new direction in the overall UX and design and the testers are planing to commit to a lot of future efforts based on the directional result of this single test. I think in such a case the lack of control of error probabilities can lead to a wrong conclusion with significant underestimation of the actual uncertainty, leading to a whole series of efforts and future A/B tests that would not have happened otherwise.

**Optimizely’s** approach, on the other hand, does correct for optional stopping, but in what appears to be a sub-optimal way. From their technical paper ^{[6]} we understand that they treat each time a test is loaded in their interface by a user as a “look” or “peek” and adjust their stats. They do so using the False Discovery Rate control method, proposed by Benjamini and Hochberg and improved by Benjamini and Yekutieli, which readers of this blog and users of our statistical significance calculator should be familiar with more than year before Optimizely started using it.

While this procedure has very interesting applications, I don’t think applying it for optional stopping is a correct approach. Optional stopping, if viewed from the perspective Optimizely approaches it, is, I argue, a case for controlling Family-Wise Error Rates (FWER), not False Discovery Rates (FDR). FDR control is less stringent than FWER control, since FDR controls the proportion of type I errors, while FWER controls the probability of at least 1 type I error. As Pekelis et al. themselves state about FDR: “It is defined as the expected proportion of false detections among all **detections** made” (emphasis ours). While FDR is more powerful, that comes at the cost of a less stringent type I error control compared to the FWER.

Since we are interested in the probability of erroneously stopping the test after any given data observation, we need control that is relevant to making a single error. It is not clear, at least to me, how FDR is able to provide such control, since as far as I understand **it offers no such guarantees**.** **In fact, as the number of looks or peeks increases, so does the number of errors permitted by the BH-FDR or BHY-FDR procedures and as consequence – the probability that we would stop when we should not. I would argue that **control of FWER is required, instead. **

Even with the above criticism aside I think the performance of the approach is sub-optimal in terms of the sample size required to test at a given significance level and power. That happens as consequence of the fact that Optimizely apparently treats **each view of a test** in their interface as data observation with the intention to stop the test, if a threshold was crossed.

“The act of viewing an A/B test must mean that it is a candidate in an experimenter’s selection procedure.”

One can immediately think of at least several cases where that won’t be: one or more tabs are loaded when a browser is started or restarted, one or more tabs remain active for extended periods of time (in the background), the user is looking at the data, but just to make sure the test is technically OK (users are gathering at the expected rate, users are assigned to all variations, etc.), the user is looking at the data to compile a report, but not necessarily with intent to react to the data… I’m sure readers with A/B testing experience can think of more (and are welcome to share them in the comment section).

Due to the above, the optional stopping adjustment will be **less powerful than optimal**, leading to longer test times, all else being equal.

**Google’s Content Experiments** takes a different path than both Optimizely and VWO and uses a **multi-armed bandit approach** instead, as described in Scott (2010) ^{[9]}. From what I gather from their technical paper and the limited information available on their help files, as well as anecdotal tests, showing a very huge gap between a classic statistical significance calculation and their own reported “chance to outperform”, it seems they process data as it accrues in batches and **evaluations to stop or continue are made with each batch**. This is introducing significant inefficiency, similar to the Optimizely’s case, but greater. However, this inefficiency should be outset to at least some extent by the distribution of traffic between the arms, but I don’t have both the details needed to assess to what extent exactly.

On a bit of a side-one: Chris Stucchio from VWO offers some interesting advice **against** using bandits in A/B testing as a whole.

## Optional Stopping and Bayesian Inference – Quotes and Further Reading

In case you’re interested in digging deeper into the issue of optional stopping in a Bayesian context there are multiple papers ^{[3,4,5,8,11,13]} and articles ^{[2,7,10]} detailing why not including the stopping rule in the analysis can have just as disastrous effect on the error-rate control (or Bayes factors) for a Bayesian design as they do in a frequentist one. As the famous statistical fraud debunker Uri Simonsohn (2014) ^{[10] }puts it while discussing optional stopping:

When a researcher p-hacks, she also Bayes-factor-hacks.

Below are a few more quotes from papers and articles that support and expand the above explanation. The reader is, of course, encouraged to read the full papers/posts.

Lindsey (1997) ^{[4]} gives examples of designs which result in the same reported likelihood (or nominal likelihood) for the final stopped experiment while in fact they have different underlying likelihood functions. This means that when an outcome-based stopping rule is not taken into account the Bayesian inference suffers from the issue of reporting significantly higher likelihood than the one actually warranted by the data.

This is further supported by Gelman, A., Carlin, J.B et al. (2003) ^{[3]} in their highly influential book where they state:

A naïve student of Bayesian inference might claim that because all inference is conditional on the observed data, it makes no difference how those data were collected, […] the essential flaw in the argument is that a complete definition of “the observed data” should include information on how the observed values arose […].

Prominent Bayesian statistician Prof. Andrew Gelman explains how and when the stopping rule should be accounted for in Gelman (2014) ^{[2]}:

…the stopping rule enters Bayesian data analysis in two places: inference and model checking:

1. For inference, the key is that the stopping rule is only ignorable if time is included in the model. To put it another way, treatment effects (or whatever it is that you are measuring) can vary over time, and that possibility should be allowed for in your model, if you’re using a data-dependent stopping rule. To put it yet another way, if you use a data-dependent stopping rule and do not allow for possible time trends in your outcome, then your analysis will not be robust to failures with that assumption.

2. For model checking, the key is that if you’re comparing observed data to hypothetical replications under the model (for example, using a p-value), these hypothetical replications depend on the design of your data collection. If you use a data-dependent stopping rule, this should be included in your data model, otherwise your p-value isn’t what it claims to be.

Stack Overflow data scientist David Robinson puts it a bit differently^{ [7]}, noting the differences in what Bayesian inference offers as compared to the frequentist error-control type of inference we are most used to:

Bayesian methods do not claim to control type I error rate. They instead set a goal about the expected loss. In a sense, this means we haven’t solved the problem of peeking described in “How Not to Run an A/B Test”, we’ve just decided to stop keeping score!

Mayo and Kruse (2001) ^{[5]} from the philosophy of science point of view are even more critical to the Bayesian approach to inference as a whole and to optional stopping and its relation to the likelihood principle in particular:

Embracing the LP is at odds with the goal of distinguishing the import of data on grounds of the error statistical characteristics of the procedure that generated them.

They take their argument to great lengths so it is a recommended read for anyone interested in foundational epistemological issues, but it goes a bit outside of the scope of this article.

* In their comparison via simulation of optional stopping vs fixed-horizon tests they observe a 3.33 times higher type I error probability and 2.41 times higher FDR when using a stopping rule based on Bayes factor and an increase in post-hoc power, but they view it as positive that the Bayes factor remains equivalent in both cases. How is it a good property to have a statistic which doesn’t react to a multiple times increase in both error measures employed is a mystery for me.

It’s also notable that in their “Conclusions and recommendations” section they do a very interesting “switch” where they suddenly prescribe that Bayesian optional stopping is not the preferred approach in what I believe are most of the practical A/B testing scenarios. Furthermore, they justify more lax requirements for tests using a long-run behavioristic rationale, claiming that if inference from a single test at hand is of interest, then strict (that is, frequentist) type I error control is preferred over the Bayesian FDR they describe in their paper. The latter is interesting, since it is quite the opposite of what frequentists are often unfairly accused of!

REFERENCES

[1] Armitage P., McPherson, C.K., Rowe, B.C. (1969) “Repeated Significance Tests on Accumulating Data”, *Journal of the Royal Statistical Society* 132:235-244

[2] Gelman, A. (2014) “Stopping Rules and Bayesian Analysis” http://andrewgelman.com/2014/02/13/stopping-rules-bayesian-analysis/

[3] Gelman, A., Carlin, J.B, Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B. (2003) “Bayesian Data Analysis” (2^{nd} edition), p.203

[4] Lindsey, J.K. (1997) “Stopping rules and the likelihood function”, *Journal of Statistical Planning and Inference* 59:167-177

[5] Mayo, D.G., Kruse M. (2001) “Principles of Inference and Their Consequences”, *Foundations of Bayesianism* (vol. 24 of the Applied Logic Series) pp 381-403

[6] Pekelis, L., Walsh, D., Johari, R. (2015) “The New Stats Engine” (Optimizely)

[7] Robinson, D. (2015) “Is Bayesian A/B Testing Immune to Peeking? Not Exactly” http://varianceexplained.org/r/bayesian-ab-testing/

[8] Sanborn, A.N., Hills, T.T. (2014) “The frequentist implications of optional stopping on Bayesian hypothesis tests”, *Psychonomic Bulletin & Review *21 Issue 2, p283-300

[9] Scott, S.L. (2010), “A modern Bayesian look at the multi-armed bandit”, *Applied Stochastic Models in Bussiness and Industry*, Issue 26, pp 639–658

[10] Simonsohn, U. (2014) “Posterior Hacking” http://datacolada.org/13

[11] Steel, D. (2003) “A Bayesian Way to Make Stopping Rules Matter”, *Erkenntnis* 58:213-227

[12] Stucchio, C. (2015) “Bayesian A/B Testing at VWO”

[13] Yu E.C., Sprenger A.M., Thomas R.P., and Dougherty M.R. (2014) “When Decision Heuristics and Science Collide”, *Psychonomic Bulletin & Review *21 Issue 2, p268

## 2 Comments

perhaps you want to have a look at

“Why optional stopping is a problem for Bayesians”,

P. Grünwald and R. de Heide, arxiv 2017 ; see https://arxiv.org/abs/1708.08278

Title says it all!

Best wishes,

Peter

Hi Peter,

Thanks for sharing the paper. I skimmed through it and though I will need to re-read for a full understanding and grasping the finer points, I think this is one of the most comprehensive attempts to elucidate on the issue of optional stopping from a Bayesian perspective I’ve seen.

I checked the Reuder 2014 “Optional Stopping: No Problem for Bayesians” paper and I found things like: “Suppose a researcher is considering two hypotheses: a null with an effect size of 0 and an alternative hypothesis with an effect size of .4.” Such hypotheses immediately make me question the practical experience of the author, as this is not a real-world situation in any field I know. Maybe physics and engineering/manufacturing experiments have such well-defined problems (do they?), but we certainly do not deal with such simple hypotheses in online marketing & online controlled experiments, in medical testing, in psychology, in biology, etc. As you note in your paper, composite hypothesis immediately cause issues for Bayes Factors except in a special class of cases.

Best,

Georgi