The Google Optimize Statistical Engine and Approach

Updated Sep 17, 2018: Minor spelling and language corrections, updates related to role of randomization and external validity / generalizability.

Google Optimize is the latest attempt from Google to deliver an A/B testing product. Previously we had “Google Website Optimizer”, then we had “Content Experiments” within Google Analytics, and now we have the latest iteration: Google Optimize. While working on the integration of our A/B Testing Calculator with Google Optimize I was curious to see what this newest product had to offer in terms of statistics. I knew that it would likely be using Bayesian inference, similar to its predecessor, but I was surprised by how little was shared about the statistical method: how they calculate statistical significance or the corresponding Bayes Factors / Posterior Probabilities, called “probability to beat baseline” and “probability to be best” in the user interface. It’s basically a black box that you have to not only trust, for some reason, but guess the inner-workings of, if you are to be able to use it in practice.

What was even more surprising was the amount of bashing at frequentist methods, including misinterpretation / misinformation about p-values, null-hypothesis statistical tests, and frequentist inference as a whole. If you’ve read this blog for a while, you know I don’t shy away from responding to and debunking such stuff, e.g. 5 Reasons to Go Bayesian in AB Testing – Debunked and Bayesian AB Testing is Not Immune to Optional Stopping Issues. In this post I will poke at the statistical approach used in Google Optimize using the scant information provided, and also address some of the criticisms to frequentist approaches to A/B testing, which comprise a significant part of their documentation.

In this article I will use citations from the Google Optimize “Statistics and methodology FAQ” pages, as they exist at time of writing (Apr 9, 2018, link updated Sep 9, 2018).

What statistics are used in Google Optimize?

“…we use Bayesian inference to generate our statistics” is about as much as you can learn from the FAQ. No particular methods, mathematical formulas, assumptions about distributions or “non-parametric” approaches, simulations, proof of concept, or any other kind of deeper explanation or justification is provided for what is presented.

In practice, no one outside a select few Google Optimize engineers can tell you what is under the hood. While this can be problematic for certain more advanced frequentist approaches, it is even worse for a Bayesian one, since it becomes practically impossible to untangle the statistics from the assumptions made by the engineers, and hence – to really understand what the data has to tell us.

I have elaborated on that in addressing claim #2 in 5 Reasons to Go Bayesian in AB Testing – Debunked , but I will restate why, briefly. It is because while frequentist inference works with minimal assumptions and tells us the worst-case scenario probability of observing a result as extreme, or more extreme than the one we observed during an A/B test:

P(Var A > Var B | X0), or in some cases P(Var A = μ > Var B | X0)

where X0 is some observed value of a test statistic X and μ is a specific average value, Bayesian inference, on the other hand, tells us:

“Given that I have some prior knowledge or belief about my variant and control, and that I have observed data X0, following a given procedure, how should data Xchange my knowledge or belief about my variant and control?”.

Expressed using the original notation it looks like so:

P(x > Δe | null hypothesis) · P(null hypothesis) / P (x > Δe)

or in a more verbal expression:

P( Observed Difference between A & B > minimum difference of interest  | Variation A = Variation B) · P(Variation A = Variation B) / P (Observed Difference between A & B > minimum difference of interest)

Thus, in a Bayesian approach we have utility or prior beliefs mixed with the data, causing all kinds of issues even when the belief is about lack of knowledge (“I know nothing”). Expressing “lack of knowledge” as probability is not as trivial as it sounds. Put simply, with a black box such as the current Optimize system, you do not know what prior knowledge, beliefs, and assumptions are baked into your statistics. This issue is not limited to Google Optimize, but is present in other system in which a Bayesian statistical engine is offered without giving enough information to understand the assumptions that went into it.

The Optimize FAQ states that “Bayesian inference is an advanced method of statistical analysis…”, but the fact that it is complicated mathematically does not make it advanced. Bayesian inference in fact predated modern statistical inference by over 150 years, if we count from the earliest attempts, so it is not really new or newer.  It was used to varying extents in the 19-th and early 20-th century and its inadequacies in giving error estimates for certain inference situations, namely experiments and trials, was what prompted Sir Ronald Fisher to develop Randomized Controlled Trials and frequentist inference with p-values and statistical significance thresholds, which was later improved upon by Neyman and Pearson, and others. This is the preferred method for inferring truths about the state of the world.

The documentation states “With respect to winners and conversion rates, we do our best to use uninformative priors — priors with as little influence on the experiment results as possible.”. There is no such thing as an uninformative prior, but there are minimally informative priors, and I suppose this is what is referred to here. But if Google Optimize were to use minimally-informative priors, then the results should match those of an equivalent frequentist method (same hypothesis, error control, etc.), rendering all claims of superiority null and void. The only gains of efficiency can come through adding more assumptions in the form of informative priors.

The use of uninformative priors will also render any claim for easier interpretation false – if the math is the same and the result is the same, how is adding more complexity to the interpretation any better? Have you tried explaining the so-called “non-informative” priors over which there is no consensus among Bayesians, to a non-statistician?

What Bayesian models are used in a Google Optimize A/B test?

In the FAQ, we read “With Bayesian inference, however, we’re able to use different models that adapt to each and every experiment”. Sounds great, in theory, with the “minor” caveat that you don’t really know which model is used for which experiments. Furthermore: how easy are those models to understand? Who has time to study all these models and learn to interpret their output? If you think p-values and NHST is hard, I don’t think you’d appreciate the complexity of these things, even if you had the necessary information before you.

The Optimize help documentation then states three models “we’ve used”: hierarchical, contextual, and restless models, but at least for me it is not clear if these are still in use in the engine, if they are used alone or in combination, and what their assumptions are, aside from a one-sentence description. From that sentence and the extended example for the hierarchical model I can conclude that results from the hierarchical model are likely to be heavily biased by early effects, no matter what the true reason for these effects is – “newness” or something entirely different, but that’s about all one can do with so little information.

“We do use some priors that are more informative, such as with our hierarchical models. In those models, we use priors that help experiments with really consistent performance to find results more quickly.” – from this sentence it becomes clear where at least some of the broad efficiency claims made by Optimize are coming from: from adding more assumptions. There is no magic way to torture the data into telling more than is in it, and retain the same level of accuracy and objectivity.

While the assumption that the cause of early beneficial effects is some kind of a “newness” effect can result in faster testing and better prediction when it is correct, one has to ask what happens when that assumption is incorrect. After all, straight up assuming your variant is better than the control is the fastest and cheapest way to test one can devise: one can just skip the A/B test altogether given a strong enough assumption.

Debunking the Google Optimize FAQ

There are many, many claims for the superiority of Bayesian methods and the inferiority of frequentist methods in the Google Optimize documentation. I will briefly address 12 of them.

Frequentist vs Bayesian A/B testing - Google Optimize

Claim #1: “We can state the probability of any one variant to be the best overall, without the multiple testing problems associated with hypothesis-testing approaches.”

Response: So can any properly-powered frequentist method (e.g. you can choose to use “Conjunctive power” when designing an AGILE A/B test), but the cost associated with such a decision in terms of how long a test will need to be, versus the expected benefits, are enormously in favor of just looking for what is superior to the baseline. A much less costly follow-up test can then be done between the top variants, if there is significant justification for that.

Claim #2: “Bayesian methods allow us to compute probabilities directly, to better answer the questions that marketers actually have (as opposed to providing p-values, which few people truly understand).”

This claim is repeated as an answer to “Aren’t frequentist methods good enough?” in this way: “But most frequentist methods provide p-values and confidence intervals that ignore any other effects, and don’t directly answer questions that testers are really asking.”.

Response: That’s 2 for the price of 1! First, Bayesian methods do not “better answer the questions that marketers actually have” – in fact they answer questions no marketer can even dream of, unless he is also a trained statistician, in which case why would he want to use a black box? This point is very well explained while responding to claim #1 in 5 Reasons to Go Bayesian in AB Testing – Debunked.

Second, if few people truly understand p-values and NHST, which are such simple concepts with minimal assumptions, then how do you expect people to understand the more complex Bayesian inference, of which probabilities much like the p-value are only a building block of? Sadly, this is a trite remark that continues to make zero sense to me. If you want to learn about p-values, read my guide to understanding p-values and statistical significance in A/B testing. Frequentist methods to not ignore “other effects”, as explained in the response to claim #4 below.

This point is further argued in the documentation with statements like “P-values are difficult to explain intuitively.” and “Even in combination with additional data, p-values can easily be misinterpreted.”, which only puts more salt in the Bayesian interpretation wound. If p-values are so hard, how can you expect someone to understand the output of the complex Bayesian machinery? A nice illustration of the difficulty can be found on Prof. Mayo’s post on interpreting Bayesian statistics here.

Claim #3: “With traditional testing methods, numerous assumptions are made that treat experiment results with a “one-size-fits-all” approach.”

Response: This sentence is a contradiction by itself. More assumptions equal a more customized approach, not less. While one can argue that traditional testing methods, by which I assume something like N-P NHST is meant, are a “one-size-fits-all” approach, that is only so if you look at the simplest methods which have very few assumptions. They offer a trade-off in that they are simple to use, understand, and communicate, and are objective in terms of making few assumptions that need to be justified. However, there is nothing that says an A/B testing practitioner should limit themselves to those methods and they can’t insert more assumptions in the statistical evaluation of any particular A/B test. There is a cost, though: the statistics become harder to calculate, interpret, and communicate to stakeholders.

Claim #4: “In the real world, users don’t always see a variant just once and then convert. Some users see a variant multiple times, others just once. Some users visit on sale days, some visit other days. Some have interacted with your digital properties for years, others are new. Our models capture factors like these that influence your test results, whereas traditional approaches ignore them.”

Response: The existence of unlimited sources of variation is well understood and I’m not certain what is implied here. A lot of the contribution of R.Fisher and frequentists as a whole is to not ignore those things, starting from his “Design of Experiments” book and onward. We use randomization in order to be able to model such unknown effects. It also distributes variation caused by all these factors about evenly, on average, among the control and variants which is especially true with sample sizes such as those often encountered in A/B testing. If instead Optimize imply the use of some kind of automatic blocking (as in “blocked RCT”) on behalf of their users, then frequentist statisticians are well-aware of blocked RCTs which can improve efficiency by balancing on known factors, but as far as I understand they are mostly beneficial in experiments with small sample sizes on things that have well-understood factors – both of these to do not apply to your typical A/B test. Similarly they are mostly applicable when the test population is fixed in advance and not suitable for the case of A/B testing.

The above is followed by a couple of unsupported claims that are mainly of the form: we make a lot of assumptions, so your tests can complete faster. As said above: the fastest test is to just assume whatever you prefer and skip testing entirely…

Claim #5: “Experimenters want to know that results are right. They want to know how likely a variant’s results are to be best overall. And they want to know the magnitude of the results. P-values and hypothesis tests don’t actually tell you those things!”

Response: Only people who have never heard of uncertainty of measurement and uncertainty of prediction want to know they are “right“. Experienced people want to measure / estimate the uncertainty. People new to A/B testing might want to know which variant is best, but experienced conversion rate optimization practitioners understand cost vs. benefit, risk vs. reward, and ROI calculations, so they know in most cases they would be happy to find a variant which is better than the control in a timely manner, and only then, maybe, worry whether Variant C, which was a little bit worse than the winner during the A/B test might perform better in reality.

As for “P-values and hypothesis tests don’t actually tell you those things!” – neither does Bayesian statistical inference! (I can do exclamation marks, too!). Check out the responses to claim #1 and claim #2 in 5 Reasons to Go Bayesian in AB Testing – Debunked post for more.

Claim #6: “Experimenters like to look at tests frequently, resulting in the “multiple look” problem. Acting on early data in a frequentist framework can result in incorrect decisions.”

Response: Surely they do, since it is ROI-positive to evaluate the data as is gathers, in many cases. This is not a problem if you not know your tools and don’t commit errors about which your are warned literally everywhere. For example, with an AGILE A/B Test you can control for the error introduced by multiple looks at the data.

However, I am yet to see a proper, widely accepted solution to this problem in a Bayesian testing framework, which of, course, assumes that Bayesians stop arguing their methods are not affected by optional stopping (see Bayesian Inference is Not Immune to Optional Stopping Issues for more). If you have such an example, please, post it in the comment section. The part of the Google Optimize FAQ that says “Because we use models that are designed to take into account changes in your results over time, it’s always OK to look at the results. Our probabilities are continually refined as we gather more data.” does not count. I can only hope they are not saying Optimize is immune to optional stopping issue as it doesn’t alter their Bayes Factors…

Claim #7: “Off-the-shelf approaches to testing assume that results aren’t affected by time, even though most experiments change as users react to new content or change behavior throughout the course of an experiment. Consequently, many experimenters find that test results don’t hold over time, even after finding a confident winner. In addition, cyclical behavior, such as differences between a weekday and weekend, often affect results, and ignoring those cycles can lead to incorrect conclusions.”

Response: This is generally an issue of designing experiments so that they address known factors that might affect the external validity / generalizability which is achieved through acquiring more representative samples. However, these are by definition external to the experiment and cannot be captured within it. For some of these factors, namely novelty, learning and reinforcement effects Google themselves describe a possible solution in their very nice 2015 paper: “Focusing on the Long-term – it’s Good for Users and Business” by Hohnhold, OBrien & Tang. However, the approach described in it does not appear to be used in Google Optimize. Furthermore – it is only applicable to experiments that do not alter UX flows and the front end as a whole, e.g. testing ranking algorithms. Has Google invented something revolutionary without sharing even the fact of the discovery with us?

I’d argue that the reason most findings “even after finding a confident winner” do not hold over time are due to poor statistical practices, issues with test analysis, issues with testing tools and data integrity, etc. and only as a last reason I would put the changes in user behavior over time as a part of the generalizability issues surrounding A/B tests. The latter, however, cannot be addressed by the test itself, by definition. As long as these issues persist, illusory results that do not hold should not be a surprise to anyone.

Short-term cyclical behavior of different kinds can be factored in by using 7-day cycles when planning A/B tests and analyzing data, which is recommended by me and many others in the field. Long-term cyclical behavior is harder to account for on a per-test basis, a difficulty which is no stranger to Google Optimize.

Claim #8: “A frequentist approach would require you to stand still and listen to it [the mobile phone] ring, hoping that you can tell with enough certainty from where you’re standing (without moving!) which room it’s in. And, by the way, you wouldn’t be allowed to use that knowledge about where you usually leave your phone.”

Response:  This is a part of an attempt to explain the difference between Bayesian and frequentist inference in the Google Optimize FAQ. It is such a ridiculous strawman and it has been responded to so many times over the decades, including recently by Prof. Mayo in her wonderful 2010 work with Prof. Spanos “Error Statistics”, that I won’t spend any more time on it.

Claim #9: “Because Bayesian methods directly compute the relative performance of all variants together, and not just pairwise comparisons of variants, experimenters don’t have to perform multiple comparisons of variants to understand how each is likely to perform. Additionally, Bayesian methods don’t require advanced statistical corrections when looking at different slices of the data.”

Response: I can only hope this is not a claim in the spirit of “Bayesian inference is immune to optional stopping” transferred to multivariate testing and segmentation. Prof. Gelman, if I am not mistaken wrote on the topic why these are an issue for Bayesian inference. I am not entirely sure what the paragraph means though, it could be just my limited knowledge of Bayesian methods, or it could be the lack of detail in the FAQ. Let me know what you think in the comments.

Claim #10: “So, we’re often faster when your data is consistent, particularly in low-volume environments, and more accurate when it’s not.”

Response: As said already: pile up the assumptions and you can rid of testing altogether. Why would you assume that more consistent data during the test or early in the test is informative of its future behavior? Why assume the opposite if it is not? Has Google Optimize found a way to extract more information than is maximally contained in a set of data? As far as I’m aware there is good reason why N-P tests are called Uniformly Most Powerful.

Claim #11: “Multivariate testing: Optimize’s approach can learn about both the performance of combinations against each other, and about the performance of a variant across various combinations. As a result, we can run all combinations, but find results much more quickly than an equivalent A/B test.”

Response: How can one compare an “A/B test” with an MVT test? Maybe several A/B tests? And why, when an MVT is handled just fine in a frequentist A/B testing tool. How is the Google Optimize approach better than a Dunnett’s adjustment and the corresponding sample size calculations for a frequentist MVT? No simulation, no expected efficiency gains, no formulas, nothing is provided to back that claim up. It is not even clear what it is being compared to. Furthermore, if MVT is to be understood as a factorial design experiment – this is never a better option compared to a classical MVT (A/B/n), since in A/B testing we rarely care solely about the effect of a single factor. If we did, then we can actually run much faster test by just making them factorial, but then – why not just run multiple concurrent A/B or MVT tests, which is much more flexible and efficient?

Claim #12: “The definition of p-value is: the probability of getting results at least as extreme as the ones you observed, given that the null hypothesis is correct, where the null hypothesis in A/B testing is that the variant and the control are the same.”

Response: The first part is obviously correct, but the second part is obviously false in all practical scenarios. There is literally no practical problem in which the null hypothesis is that the variant and the control are the same, as I argue in One-tailed vs Two-tailed Tests of Significance in A/B Testing. In superiority A/B test the null is that the variant is worse than or equal in performance to the control, while the alternative is that the variant is better (composite, not point hypothesis). A superiority margin can also be established if there is a need for a minimum required improvement in order to justify the intervention (I call these “strong-superiority” tests). In a non-inferiority case or a reverse risk case these hypothesis change accordingly.

While there might be vendors who offer two-tailed tests in which the null hypothesis is that the variant and the control are the same, I argue they are wrong in doing so and that no frequentist statistician worth his salt uses two-tailed tests as something other than a basic educational device, or, at best, a complimentary statistic. My full argument for this is presented in a series of articles published on the Project OneSided site.

Some other gems from the documentation

“Unlike frequentist approaches, Bayesian inference doesn’t need a minimum sample.”, yet, strangely, there is a 2-week minimum duration for tests in Google Optimize. Hm… And yes, I know they justify it by cyclic behavior, but if the fix is to set a 2 week minimum in Google Optimize, then why is such a fix out of reach for frequentists, as claimed above, is a mystery to me. Whatever external validity concerns one has will be addressed in a similar fashion no matter what the statistical machinery is.

“Tools that allow for predicting experiment length assume that there’s no variability or dependence on time. This is seldom true in real experiments.” – if this references external validity, then there are some precautions in some existing tools, such as our A/B testing calculator, however it is ultimately not something that can be fully addressed by a statistical tool. By definition it can only ensure internal validity and can not ensure external validity. The tool can provide guidance (e.g. test for full weeks or other periods of importance for generalizability) and also methods for balancing between the need to acquire more a representative sample and end the test quickly (certain alpha spending functions), but it is the practitioner who, using outside information, needs to design the test in a way that increases the probability of obtaining a representative sample. A practitioner can also employ visual examination of the trend of the data to detect within-experiment time trends, while external information will be needed to decide if these are to be taken into account in the final judgement of the generalizability of the test results.

“with low traffic, you can still find actionable results” – same with frequentist methods, but your A/B Testing ROI might not be great, or it can be negative, prompting you to just implement without testing.

“Why does Optimize show a range of numbers for “improvement”? Most tools don’t do that.” – I literally can’t think of a major tool that doesn’t show confidence intervals or some kind of Bayesian credible interval. Can you?

Concluding remarks

Despite what you might come off thinking after reading the above, I don’t mind Bayesian approaches to testing in general. I’m just yet to see a good one that a person other than the engineers who built it can use properly. I would be grateful if someone can point me to a proper Bayesian solution for modeling sequential monitoring of the data (Wald’s SPRT is frequentist), since it is one of the major hurdles in front of all solutions I have reviewed so far. I’d be interested if someone can point me to a Bayesian approach which is more efficient and equally or more precise than the fully frequentist AGILE method (numbers vs. fixed-size tests are available in the white paper), or equivalent approach. I would be happy to see documentation for a good Bayesian A/B testing tool in which the statistics are so clearly and easily explained that I can understand them well enough to be able to use them.

What I do mind is when the methods, foundational to the growth of many 20-th century sciences are not just questioned and smeared, but also straw-manned, only the simplest of simplest approaches are considered, trite critiques to which responses have been given hundreds of times over the decades are presented as fact, and so on.

I do mind when claims of superiority, improved efficiency and more accurate results, of which the Google Optimize documentation abounds as evident above, are presented without any kind of explanation, justification, exact or even approximate comparisons to other established methods (no numbers are given to quantify the broad claims), and not a single simulation result is given. I understand that Google Optimize is not open source, so I do not expect to see all its code anytime soon, but there are ways to communicate and prove your competitive advantage other than posting your code.

I do not think it is justified to present a technology without any hint of its shortcomings and the trade-offs involved in its usage. You would not want to lead with those, obviously, but here we are discussing the technical FAQ, not a front page or an ad banner.

I strongly believe doing the above is doing a disservice to the larger UX, conversion optimization and A/B testing community. I see no good reason for it.

About the author

Georgi Georgiev

Managing owner of Web Focus and creator of Analytics-toolkit.com, Georgi has over eighteen years of experience in online marketing, web analytics, statistics, and design of business experiments for hundreds of websites.

He is the author of the book "Statistical Methods in Online A/B Testing" and white papers on statistical analysis of A/B tests, as well as a lecturer on dozens of conferences, seminars, and courses, including as a Google Regional Trainer.

This entry was posted in A/B testing, Bayesian A/B testing, Conversion optimization, Statistics and tagged , , , , , , , , , , , , . Bookmark the permalink. Both comments and trackbacks are currently closed.

Take your user testing program to the next level with the most comprehensive book on A/B testing statistics.

Learn more

Discussion (4 comments)

  1. Michael HayesSeptember 9, 2018

    Thanks for the interesting and thought provoking post.

    I’m not a statistician, just a GA user, and I don’t even do much AB testing – however I believe one advantage of Optimize using it’s multi-armed bandit model (assuming its the same as Google Experiments) is that it can send more traffic to what currently looks like the winning version, to see if it continues to perform well.

    So the advantage here would be that a better performing version would win more quickly and you would get more conversions while the experiment is running than if you stuck to a 50-50 split.

    Just wondering what your opinion on this is? I guess they might be the reasons for claims #10 and #11. (I haven’t read the original FAQ so missing a bit of context, looks like it has moved and I haven’t tried to search for it).

    Looks like some great material on this blog, about to read some more pages 🙂

    • Georgi GeorgievSeptember 9, 2018

      Hi Michael,

      Unfortunately, this is not an advantage. While some, e.g. Jennison & Turnbull 2009 “From Group Sequential to Adaptive Designs” are moderately opposed: “It is our view that the benefits of such [data-dependent] modifications are small compared to the complexity of these designs.” others like Lee, Chen & Yin 2012 “Worth adapting? Revisiting the usefulness of outcome-adaptive randomization” and Tsiatis & Mehta 2003 “On the inefficiency of the adaptive design for monitoring clinical trials” have proven that such designs are not more efficient compared to equivalent sequential testing designs, e.g. ones using alpha-spending functions as in the AGILE A/B testing method I’m pushing for.

      Stucchio in his 2015 “Don’t use Bandit Algorithms – they probably won’t work for you” points out to further inherent disadvantages to using bandit algos in online A/B testing to do with assumptions behind the algorithms, the most problematic for me being #3 that one needs to be able to observe the outcome of an “arm pull” before he can select the probability distribution for the next “arm pull”, which is utterly non-applicable in most online scenarios. I can see these algos work in things like automatically optimizing CTR of an ad, but not for A/B testing in general.

      There are other disadvantages as well, like not being able to control the sample size in a way that results in more representative samples, and probably others I’m not aware of at the moment.

      Thanks for pointing out that the FAQ link is no longer working. I guess Google needs to hire better SEO’s. I’ve updated the link to the new location.

      Best,
      Georgi

  2. SimonDecember 19, 2018

    Hi Georgi,

    I’m glad I came across your post, very insightful.

    I’m new to CRO and am trying to introduce a testing culture in our company. Do you think Google Optimize could be a good starting point given the company is currently not interested in investing in optimisation tools or specialists?

    Thanks,