In this article I’m revisiting* the topic of frequentist vs Bayesian inference with specific focus on online A/B testing as usual. The present discussion easily generalizes to any area where we need to measure uncertainty while using data to guide decision-making and/or business risk management.
In particular, I will discuss each of the following five claims often made by proponents of Bayesian methods and show their lack of merit:
- Bayesian statistics tell you what you want to know, frequentist ones do not – the p-value is difficult to understand, has no business value, not what we’re looking for, etc.
- Frequentists don’t state their assumptions, Bayesians make the assumptions explicit
- Frequentist statistical tests require a fixed sample size and this makes them inefficient compared to Bayesian tests which allow you to test faster.
- Bayesian methods are immune to peeking at the data
- Bayesian inference leads to better communication of uncertainty than frequentist inference
Note that the discussion on the first argument takes up almost 50% of the article. Let’s dig into frequentist versus Bayesian inference.
1. Bayesian statistics tells you what you really want to know
while frequentist p-values, confidence intervals, etc. give you meaningless numbers. That would be an extreme form of this argument, but it is far from unheard of. It can be phrased in many ways, for example:
“Bayesian methods better correspond to what non-statisticians expect to see.”
“Goal is to maximize revenue, not learn the truth”
“Customers want to know P(Variation A > Variation B), not P(x > Δe | null hypothesis) ”
Chris Stucchio, data-scientist behind VWO’s stats engine [1,2]
“Bayesian methods allow us to compute probabilities directly, to better answer the questions that marketers actually have (as opposed to providing p-values, which few people truly understand).”
“Experimenters want to know that results are right. They want to know how likely a variant’s results are to be best overall. And they want to know the magnitude of the results. P-values and hypothesis tests don’t actually tell you those things!”
Google Optimize Help Center [3]
The general idea behind the argument is that p-values and confidence intervals have no business value, are difficult to interpret, or at best – not what you’re looking for anyways. Various arguments are put forth explaining how posterior probabilities, Bayes factors, and/or credible intervals are what end users of A/B tests really want to see.
Linguistic confusion as an argument for Bayesian inference?
What is curious about the argument that Bayesian inference tells you what you really want to know is that most of the time it stems from linguistics. One would go around asking people what they want to get from a statistical estimate and when they inevitably get answers like ‘the probability that B is better than A’ or ‘the chance that B outperforms A’ or ‘the likelihood the outcome is not due to chance’ this would be interpreted as evidence in favor of Bayesian inference.
Since only inverse inference is capable of providing such answers the argument seems to have merit at first.
However, that is only if we take these claims at face value, assuming the respondents use terms like ‘probability’, ‘chance’, and ‘likelihood’ in their technical definition. In reality most of them have in mind the everyday meaning of these terms in which all of them are synonymous and not the technical meaning of the terms.
If one digs deeper and asks a second question to clarify if the users actually want Bayesian inference in proper terms, something like:
- “Do you want to get the product of the prior probability and the likelihood function?”
- “Do you want the mixture of prior probabilities and data as an output?”
- “Do you want subjective beliefs mixed with the data to produce the output?” (if using informative priors)
- Would you be comfortable presenting statistics in which there is prior information assumed highly certain mixed in with the actual data?
Then things really start to get interesting since even a casual observer will realize that non-statisticians find inverse inference just as confusing as straight inference (frequentist statistics), if not more. Questions like these would start to pop up:
- What is a prior probability? What value does it bring?
- What is a likelihood function?
- What ‘prior’ probability, I have no prior data?
- How do I defend the choice of a prior probability?
- Is there a way to communicate just what the data says, without any of these mixtures?
showing just how difficult inverse probability is. This is true in online A/B testing, in clinical trials, in quality control, in physics, and everywhere else.
A claim for the superiority of Bayesian inference cannot be based simply on the fact that a lot of people fail to use the proper technical language when expressing their needs. It can’t even rest on the fact that people don’t intuitively grasp the finer points of probability and frequentist inference. Instead it should be based on deeper probing for what experimenters want to know. Bayesian inference should then be put to the same type of scrutiny with questions which have a high probability of exposing misunderstandings similar to or worse than those arguably present for frequentist methods.
Pushing the argument for Bayesian inference further
If one pushes the Bayesian argument further they may be faced with studies where the respondents say they want to know what is the best course of action or that they want to maximize profits or something similar. This puts the question firmly in decision-theoretic territory – something neither Bayesian inference nor frequentist inference can have a direct say in.
There are rival decision-making theories developed both on the Bayesian side and the frequentist side where decision-making methods date back to at least WWII [4]. I’ve developed and implemented one such framework for A/B testing in particular and it can be found in Analytics-Toolkit.com’s A/B testing ROI calculator (see more on risk/reward analysis in A/B testing plus further reading).
Frequentist statistical inference is guided by the decision-making considerations insofar as one needs to settle on an acceptable balance between type I and type II errors pre-test. This doesn’t affect the post-test statistical estimates of frequentist inference one iota. Bayesian arguments seem to often spill into decision-making without setting clear boundaries between assessments of different claims vis-a-vis the data and making decisions based on these inferences. In such cases I don’t think its fair to even refer to it as “Bayesian inference“.
For a deeper reading on the attempt by Bayesians to re-frame inference in decision-theoretic terms and the resulting confusions see Spanos (2017) [5].
What if frequentist and Bayesian methods produce the same answer?
There are multiple online Bayesian calculators and at least one major A/B testing software vendor applying a Bayesian statistical engine which all use so-called non-informative priors (a bit of a misnomer, but let’s not dig into this). In most cases the results from these tools coincide numerically with results from a frequentist test on the same data. Let us say the Bayesian tool will report something like ‘96% probability that B is better than A’ while the frequentist tool will produce a p-value of 0.04 which corresponds to a 96% confidence level.
In a situation like the above, which is far more common than some would like to admit, both methods will lead to the same inference and the level of uncertainty will be the same, even if the interpretation is different.
What would a Bayesian say about this result? Does it turn the p-value into a proper posterior probability when viewing a scenario in which there is no prior information? Or are all these applications of Bayesian tests misguided for using a non-informative prior per se?
If it is the former, then why bother with the more computationally intensive Bayesian statistical estimates? If it is the latter, one has to ask where the distinction between proper and improper application of Bayesian inference is? There certainly is a well-established demarcation line for frequentist methods in this regard. I’ve not seen the same demarcation for Bayesian methods.
Why we need frequentist inference
Showing the deficiencies of arguments in favor of Bayesian methods goes a long way towards debunking them. Now I will briefly make a positive case for frequentist statistics.
Frequentist error-statistical methods provide us with an objective measure of uncertainty under a specified statistical model. In most online A/B tests we care about testing one-sided hypotheses and making claims in one direction only, hence the frequentist methods result in conservative worst-case measures.
Frequentist inference allows us to assess the input of the data separate from any non-essential, potentially subjective, and often non-testable assumptions. We can then use the data and its uncertainty measure to probe specific claims such as (in an A/B test):
- B is better than A
- B is X% better than A
- B is no different than A (to an extent X)
- B is worse than A
- etc.
Frequentist p-values, confidence intervals, and severity, tell us how well-probed certain claims are with the data at hand. Bayesian posterior probabilities, Bayes factors, and credible intervals cannot do that. And they (usually) don’t claim to do so.
Frequentist statistical estimates can then be entered into any decision-making process that one finds suitable. The outcomes of the decision-making machinery with different hypothetical inputs based on business considerations (costs and benefits), information external to the A/B test at hand (prior tests, case studies, etc.), or both can be examined and decisions made accordingly. That is where the business value of frequentist inference becomes apparent. If we’ve used some kind of optimal procedure for choosing the sample size and significance threshold, then the decision following a frequentist test is straightforward and has immediate business value in allowing one to act on the situation at hand with an optimal amount of information (speed, promptness) and tolerance for uncertainty.
Alternatively, post-hoc measures like severity could be used on their own (regardless of the procedure for selecting the sample size and significance threshold) as input for a decision-making apparatus.
Note that one is not constrained from using the results from a frequentist inference in any Bayesian decision-making system of their choosing. On the contrary, plugging Bayesian statistics from a given system into any other system (including other Bayesian systems) requires that the prior is subtracted from the data first unless you yourself chose the prior and are fully committed to it (How many Bayesian tools used in A/B testing allow you to set your prior? How many allow you to even examine the prior they use?).
(Update 2023) The above argument was further extended and illustrated in Bayesian Probability and Nonsensical Bayesian Statistics in A/B Testing and in a way also in False Positive Risk in A/B Testing.
2. Frequentists don’t state their assumptions, Bayesians make the assumptions explicit
Too often one hears that both Bayesian and frequentist methods of inference make assumptions, but only the Bayesian ones are laid out for everyone to see and assess. Is that really the case?
The only assumption made explicit in a Bayesian setting is the prior distribution. And that is an assumption which is not involved in a frequentist test at all so accusing frequentists for not stating their prior would be an utter blunder.
Here is an example of the argument of the prior distribution making the assumptions explicit, e.g. in Stucchio’s words:
Bayesian methods make your assumptions very explicit. For example, consider the prior — many people criticize Bayesian methods because the prior is an arbitrary choice. However, frequentist methods also have arbitrary choices like these embedded. It’s just harder to tell because they are buried implicit in the middle of the math rather than the beginning.
Chris Stucchio, data-scientist behind VWO’s stats engine [1]
Bayesian and frequentist inference share the same underlying assumptions but Bayesian’s can also add assumptions on top. I fail to see how adding an assumption which is lacking in frequentist inference makes Bayesian inference more transparent.
The main assumptions behind most frequentist models are those of the shape of the distribution, the independence of observations, and the homogeneity or heterogeneity of the effect across observations. These are all clearly stated for every frequentist statistical test, discussed widely in the statistical community, and the extent to which different tests are robust to violations of their assumptions has been studied extensively.
The same assumptions would be in place for all parametric Bayesian methods and so far I’ve not seen these assumptions being presented or communicated any differently than they are for frequentist tests. Namely, professional statisticians know all about them while end users are generally oblivious, often erring in application of both types of inference procedures as a result.
Non-parametric, or rather low-parametric methods (a.k.a. robust statistics) are a different cup shared by both approaches. Tests robust to various assumption violations certainly exist in frequentist inference but are avoided when assumptions about the parameters can be tested and defended. The reason is that using low-parametric methods usually results in less sharp inferences. This is why in online A/B testing non-parametric tests are rarely employed.
Furthermore, it is practitioners of frequentist inference (see the work of Aris Spanos for example) who have insisted that the assumptions of each test are themselves tested before an inference can be declared trustworthy. Funnily enough, Bayesians turn to frequentist significance tests when they inevitably face the need to test the assumptions behind their models.
So, not only do frequentists tests come with explicit assumptions, but frequentist inference also provides the tests for these assumptions vis-a-vis the data: a whole host of tests for normality, the goodness-of-fit test of which a sample ratio mismatch test is an example application, as well as many others.
3. Frequentist statistical tests require a fixed sample size
In the CRO community and perhaps other disciplines the word is that frequentist statistical tests require a fixed, predetermined sample size, otherwise they are invalid:
“A frequentist approach would require you to stand still and listen to it [the mobile phone] ring, hoping that you can tell with enough certainty from where you’re standing (without moving!) which room it’s in. And, by the way, you wouldn’t be allowed to use that knowledge about where you usually leave your phone.”
Google Optimize Help Center [3]
Another common misconception stems directly from the above fixed horizon myth – that frequentist tests are inefficient since, as per the above citation, they require us to sit with our hands under our bums while the world whizzes by.
It is fascinating that in 2020 there is still refusal to acknowledge that frequentist inference consists of something more than the simple fixed-sample t or z-test. That is some 75 years since Wald [6] invented and documented the first frequentist sequential test (which was considered such an important piece of intellectual property that it was classified during the war). That’s after sequential tests have been the standard in disciplines like medical trials for decades and their prevalence is only spreading to other settings where they make sense. It is honestly beyond me how this could be the stance of a team which is part of a company otherwise considered to be on the forefront of online experimentation.
To not drag this longer than necessary – frequentist inference includes tests without a fixed predetermined duration. They have a maximum sample size (informed by the required balance of type I and type II errors), but the actual sample size will vary from case to case depending on the observed outcome. Sequential tests, as they are called, are both commonly used and widely accepted.
These come in two general varieties. The first type are Sequential Designs where allocation between groups, number of variants, and a few other parameters are fixed throughout the duration of the test while one can vary the number and timing of interim analyses and stop with a valid frequentist inference when a decision boundary is crossed. An example of a particular type of such a test would be the AGILE statistical method for conducting online A/B tests as proposed by me and as implemented in a publicly available software.
The second type forms the family of frequentist Adaptive Sequential Designs. In an ASD one can vary the allocation ratio, number of arms, and a few other key parameters on top of the agility provided by Sequential Designs. Several works point to ASDs being slightly inferior or at best – equal to the above mentioned simpler Sequential Designs and so thus far I’ve not given them further consideration.
4. Bayesian methods are immune to peeking at the data
‘Peeking’ at data, a.k.a. optional stopping definitely violates the assumptions of the simplest of frequentist tests (unaccounted peeking with intent to stop) and makes them inapplicable. Multiple looks at the data with intent to take action requires a proper sequential design in order to preserve the error guarantees crucial for a valid error-statistical inference.
According to some, Bayesian inference miraculously avoids this complication and is in fact immune to peeking / optional stopping. These Bayesians are all about updating beliefs with data so whether you update your posterior after observing every user or whether you update it once at a predetermined point in time is all the same. Updating your posterior and using it as the next prior in the application of the Bayes Theorem seemingly requires no adjustment of the way the Bayesian inference works. Data is data and it doesn’t matter what procedure was used to produce it according to these same Bayesians who usually belong to a crowd which conflates (or mistakes?) inference with decision-making.
Luckily, there are many sound voices in the Bayesian camp who recognize that “data” only makes sense under a statistical model for how it was generated and if a model fails to include a key part of the way the numbers were acquired, it is not applicable. The assumptions behind the model do not correspond to the reality of the way the experiment was conducted. Any output it produces is then inapplicable as well. If proper Bayesian inference is what one is after then peeking matters just the same.
If the above short rebuttal is not satisfactory for you, I’ve expanded on this issue before with ample citations in “Bayesian AB Testing is Not Immune to Optional Stopping Issues”.
5. Bayesian inference leads to better communication of uncertainty than frequentist inference
This argument really only makes sense if you accept argument #1 as presented above – that Bayesian inference tells you what you really want to know. Otherwise both schools of thought have very similar tools for conveying the results of a statistical test and the uncertainty associated with any estimates obtained. Frequentists have point estimates, p-values, confidence intervals, p-value curves, confidence curves, severity curves, and so on. Bayesians have point estimates, credible intervals, Bayes factors, and posterior distributions which pretty much fill in for the aforementioned curves.
As a matter of fact in a recent informal study on The Perils of Poor Data Visualization in CRO & A/B Testing I’ve found that at least in online A/B testing calculators, Bayesian tools were more often tempted to present potentially misleading (or at best – irrelevant) probability curves. I do not know If that is the case in other disciplines.
I think it is immediately clear that under proper usage both frequentist and Bayesian inference have the same tools for presenting uncertainty. The difference is that one presents one kind of uncertainty measure (error-statistical) while the other presents an uncertainty measure of a very different kind. Hence, the only claim for superiority of one method over another may come from a claim of superiority on point 1. Since I do claim superiority of frequentist inference over Bayesian inference in point 1 I would claim that frequentist tools for presenting uncertainty are in fact better. If you still disagree with me, then you’d go for the reverse here.
Summary
With that I’ll conclude this examination of the frequentist vs Bayesian debate. I believe that point #1 is where most of the debate stems from, hence I gave it the most space. I believe the mixing of inference and decision-making to be the main culprit behind the misguided claims for the superiority of Bayesian methods. If you want to explore a frequentist view on the interplay of the two disciplines I believe you’ll find my recent book ‘Statistical Methods in Online A/B Testing’ useful. It uses the actual business case of applying statistics in online A/B testing with a focus on e-commerce.
To sum up the other four arguments for Bayesian inference discussed above:
- Point #5 seems to depend entirely on the acceptance of point #1.
- Point #2 remains confusing for me is it is actually a point in favor of frequentist methods which involve fewer assumptions all of which are testable with regard to the test data.
- Point #3 is a clear-cut case of misrepresentation of frequentist inference and the statistical repertoire at its disposal.
- Point #4 has avid opponents in the Bayesian camp itself, and falls apart from an error statistical perspective regardless of one’s preferences.
See any flows in my arguments? Have counter-arguments? Or even just an example of a Bayesian tool which addresses multiple data evaluations while retaining probing capabilities? Are there solid arguments for Bayesian inference not discussed here? Looking forward to your comments.
* for the first installment see: “5 Reasons to Go Bayesian in AB Testing – Debunked” . For (sort of) a second installment see “The Google Optimize Statistical Engine and Approach”.
References
1 Stucchio C. (2020) “A conversion conversion with Chris Stucchio” [available online at https://medium.com/experiment-nation/a-conversion-conversation-with-chris-stucchio-596cdbd54494]
2 Stucchio C. (2015) “Bayesian A/B Testing at VWO”, p.21 [available online at https://www.chrisstucchio.com/pubs/slides/gilt_bayesian_ab_2015/slides.html#21]
3 Google (2020) “General Methodology” under Methodology in the Optimize Help Center [available online at https://support.google.com/optimize/answer/7405543?hl=en&ref_topic=9127922]
4 Wald, A. (1939) “Contributions to the Theory of Statistical Estimation and Testing Hypotheses.” The Annals of Mathematical Statistics, 10(4), p.299–326 doi:10.1214/aoms/1177732144
5 Spanos, A. (2017) “Why the Decision‐Theoretic Perspective Misrepresents Frequentist Inference: Revisiting Stein’s Paradox and Admissibility”, Advances in Statistical Methodologies and Their Application to Real Problems, edited by Tsukasa Hokimoto, published by IntechOpen doi:10.5772/65720
6 Wald, A. (1945) “Sequential Tests of Statistical Hypotheses” The Annals of Mathematical Statistics, 16(2), p.117–186 doi:10.1214/aoms/1177731118