What is the best design for a statistical test with sequential evaluation of the data at multiple points in time? This is a question anyone who has realized that unaccounted for peeking with intent to stop is the bane of A/B testing eventually comes to ask. So how does one go about answering that?
This article will present the main available options, namely fully sequential and group sequential tests, their strengths and weaknesses, and attempts to outline what test to use in which scenario. It draws heavily on Chapter 10 (“Sequential testing: continuous monitoring of data”) of my book “Statistical Methods in Online A/B Testing” as well my 2017 white paper “Efficient A/B Testing in Conversion Rate Optimization: The AGILE Statistical Method”, but also contains insights not shared before, including commentary on some potentially misleading claims by A/B testing software vendors.
The article will start with a bit of history to show the evolution of sequential testing which will help put the rest of the article into context. The article ends with a brief note on the related concept of tests of power of one. Online A/B testing remains the focus throughout.
The full outline:
- Early development of sequential statistical tests
- The rise of group sequential tests
- When to use fully sequential tests
- When to use group sequential tests
- A word on tests of power one
Early development of sequential statistical tests
While there were some applications of sequential testing procedure before that, the first significant development in modern sequential testing happened during the Second World War. In 1943 Abraham Wald, then working as part of the Statistical Research Group at Columbia University, developed the Sequential Probability Ratio Test (SPRT) [1]. The SPRT aimed to reduce the cost of assuring the quality of the different equipment, vehicles, ammunition, etc., produced during the war effort. Since a lot of industrial testing is destructive for the tested item, and takes a fair bit of effort, the goal was to minimize the number of items inspected, in order to either clear a batch of items, or to declare it unfit.
The SPRT is quite successful in that it reduces the average sample size required by 40-50% compared to a fixed sample test with an equivalent design in terms of type I and type II error control.
The cost of this success on average is an increase in the maximum sample size that a sequential test can take to reach a conclusion, meaning that it suffers from decreased statistical power at any given sample compared to an equivalent fixed horizon test.
Sequential tests achieve significantly better average test efficiency at the cost of a decrease in statistical power.
In the case of a classic SPRT the maximum sample size is unrestrained. Testing can in fact continue as long as the statistic remains in the indecision region between the two stopping bounds.
Important characteristics of the scenarios to which these early tests were applied include the following:
- outcomes can be observed and acted on immediately (e.g. a shell pierces a barrier or fails to do so)
- test conclusions generalize easily (there are minimal external validity concerns)
- having a required time / sample size by which to conclude the test was of secondary concern
- decision-making is of primary importance, arriving at reliable estimates is of little apparent concern
While there certainly were concerns regarding the duration and expenditure on any given test, these early developments could only address them to some extent. Wald himself presents ‘truncated sequential tests’ to handle the need to stop with a definite sample size in his 1947 book “Sequential Analysis”. However, these are quite crude and he offers only a limited set of upper bounds at given thresholds for type I and type II errors (p. 61-65).
Regarding estimation, Wald offers just a rough sketch in terms of confidence intervals. There seems to be no mention of the bias introduced to point estimates (maximum likelihood estimation).
In 1957 P. Armitage developed so-called “restricted sequential tests” to address the need for sequential tests with a specified maximum sample size while also having rigorous error control [2]. These tests, also known as triangular sequential probability ratio tests, provide proper error control, unlike the truncated tests of Wald.
Classic SPRT and restricted sequential tests can both be described as “fully sequential” in that data is evaluated as it gathers, and decisions to stop or proceed are made after each observation.
The costs of running fully sequential tests include poor generalizability in case of threats to external validity, a penalty to statistical power, and issues handling lagged effects in early evaluations (or the requirement for complicated mechanisms to handle them).
The rise of group sequential tests
With an idea of the scenarios that brought sequential tests to live, one can turn to group sequential tests. As the sequential literature grew, more and more applications for sequential evaluation of data were found, and so developments were rapid in the years following the work of Armitage. First, fixed analysis time evaluation tests were introduced by Pocock [3], Haybittle[4], Peto et al.[5], and O’Brien and Fleming [6] starting from the early 1970s. It was extended and made more flexible by Lan and DeMets [7] alpha-spending approach and a significant literature on estimation following group sequential trials followed and continues to this day.
What motivated these developments was the need to address shortcomings of the fully sequential tests:
- outcomes were often impossible to observe and act on in real time due to lagged effects or logistical reasons
- applications in trials involving humans meant dealing with significant concerns about threats to external validity
- having a definite time / sample size to complete the test and come to a conclusion is of major concern in both business and scientific applications where deadlines abound and huge uncertainty regarding timeframes can rarely be tolerated
- arriving at more reliable and more efficient estimates is of significant concern (although still secondary to performing efficient hypothesis tests)
Just a quick comparison between this list and the characteristics of the scenarios that gave birth to modern fully sequential tests should make it clear what group sequential analyses offer to most online A/B testing scenarios.
The benefits of group sequential tests compared to their fully sequential precursors are most obvious for tests that stop really early, or really late, but apply regardless. In no particular order:
- group sequential tests (GSPs) better accommodate lagged effects due purely to the fact that there is more time to differentiate between a treatment which performs better over relevant periods of time versus a treatment which reduces the lag to the desired action, potentially at the cost of hurting overall performance.
- GSPs better align with cases where acting on the data is only possible at some interval, either due to lagged effects or logistical reasons. The result is increased statistical power compared to evaluating the data regardless of when it can be acted on.
- GSPs alleviate external validity issues through the shape of the alpha-spending function (more stringent early on) and the spacing of analyses to compensate for short-term variability. The latter is typically done by performing analyses weekly, but even daily analyses would have significant benefits on generalizability compared to fully sequential evaluation.
- GSPs allow the imposing of a hard stop with a conclusion in either direction. If you know you have at most four weeks to make a certain call, a GSP will accommodate that. Restricted fully sequential tests allow for that, others do not. If you are not asked when you want to or have to have an outcome from a test, then it is using the classic SPRT or a variant of it.
- group sequential tests have higher statistical power (sensitivity) to true effects effects of any particular size.
Given that most statistical applications in A/B tests in conversion rate optimization share the same motivation which lead to the rise of group sequential tests, it should be no surprise that group sequential methods would be preferred over their fully sequential precursors. This is a choice I made back when I first started working on the Analytics Toolkit statistical engine despite the significantly more complicated implementation advanced GSP methods entailed over the simpler SPRT.
Group sequential tests align best with most A/B testing scenarios.
However, this is not to say that there is no place for fully sequential tests in A/B testing, in CRO or beyond. Below is a brief review of scenarios where one type of test is preferable to the other.
When to use fully sequential tests
Fully sequential tests are likely the best choice for evaluating guardrail metrics in online A/B testing. These are typically metrics with immediate feedback (errors in generating / displaying a page or element, page load time, etc.) and in such scenarios external validity issues are of little to no concern. The fully sequential tests offers the best average efficiency and will shut down a test very quickly if needed, most of the time. Since acting on the outcome of the test happens automatically, one benefits fully from their efficiency. Statistical power is at best a secondary concern in these scenarios so the poor performance of fully sequential tests on that front has little relevance. Estimation accuracy is likewise of little concern.
Fully sequential tests can also be used when the nature of the A/B test is such that it can be left running for months on end if necessary without the need for a timely conclusion. External validity should also be of no particular concern, as well as power. Outside of asserting quality control on a perpetual basis, and automatically taking action on a test outcome, I cannot think of other such applications based on my admittedly limited experience.
In case there is no ability to automatically act on the outcome of a fully sequential test its efficiency bonus over a group sequential test is denied while still incurring the cost of poorer power relative to a GSP. Therefore, if the action following a test outcome is not automated, a fully sequential test should not be applied in that scenario due to loss of utility and the power penalty.
When to use group sequential tests
Group sequential tests are called for when there are concerns of external validity issues affecting small samples / short tests in terms of time, which is typically the case. All three main types of threats to generalizability: time-related, population change, and learning effects, are better handled in a group sequential evaluation of A/B tests as opposed to fully sequential.
Group sequential tests are also a good choice when deadlines need to be adhered to and decisions cannot be postponed indefinitely, although restricted fully sequential tests remain an option if this is the only concern and none of the drawback of fully sequential tests are relevant.
If one can’t act on each observation, then group sequential tests should be preferred for their higher power compared to fully sequential ones (with equivalent efficiency). Unless the action following an A/B test can be fully automated, it does not make sense to use a fully sequential test whose main purpose is to allow acting in an immediate and/or automatic fashion. For example, deciding to check on and act on a fully sequential A/B test once a week means all the rest of the sequential evaluations are a pure loss of power with no benefit to average efficiency.
Overall, group sequential A/B tests offer the benefits of sequential testing such as more efficient testing on average and the ability to act on data as it gathers, while minimizing the loss of statistical power and suffering only to an extent in terms of bias and variance of the statistical estimates compared to fixed sample tests. Therefore, when one seeks to balance statistical power and testing efficiency, especially in the face of external validity concerns and inability to act immediately after each observation, group sequential tests should be the choice.
A word on tests of power one
While not being intrinsically related to the discussion about fully sequential versus group sequential tests, claims about sequential tests having a power of one (100% power) are becoming more common when talking about sequential statistics.
It has been established by Wald as early as 1947 that a sequential test will eventually terminate with probability one (Section A.1 of the Appendix in “Sequential Analysis”). However, speaking of “tests of power one” does not make any practical sense despite attempts by vendors and evangelists to claim otherwise. Given that tests typically last anywhere from a few days or weeks to a couple of months, and that there is typically a set deadline by which to make a decision either way, speaking of “tests of power one” seems highly misleading.
The reason is that it leaves crucial context out. Namely, it fails to compare it to alternatives such as fixed sample tests. This is important because a fixed sample size test with a given significance threshold and minimum effect of interest has statistical power greater than that of any equivalent sequential test with the same number of users. Furthermore, any well-designed A/B test, regardless of type, will have power less than one. So even though claims for tests with power one may theoretically hold, they lack any practical substance.
Claims for tests with power one lack context and are practically irrelevant.
For example, if a website is used by 30,000 users per week and a test can run for a maximum of eight weeks, the maximum available sample size in eight weeks is 30,000 x 8 = 240,000. With alpha = 0.01 and minimum effect of interest of 0.003 and a baseline of 0.05 (about 6% relative lift), a fixed sample size test has statistical power of roughly 84%. A sequential test with weekly evaluation of the outcomes (eight analyses) and a maximum sample size of 240,000 has a power of just under 80% for the same minimum effect of interest. This translates to an increase in the type II error rate from 16% to 20%+, or a 25% relative increase. A fully sequential test of any kind will have even less power with 240,000 users.
None of the statistical tests described above has a power of one. For sequential tests, the more data evaluations there are, the less powerful the test, all else being equal. The same can be observed at any practical stopping time and equivalent sample size, exposing ‘power of one’ claims as misleading.
References:
1 Wald, A. 1945. “Sequential Tests of Statistical Hypotheses.” The Annals of Mathematical Statistics 16: 117-186. doi:10.1214/aoms/1177731118.
2 Armitage, P. 1957. “Restricted Sequential Procedures.” Biometrika 44: 9-26. doi:10.2307/2333237.
3 Pocock, S.J. 1977. “Group sequential methods in the design and analysis of clinical trials.” Biometrika 64: 191–199. doi:10.2307/2335684.
4 Haybittle, J.L. 1971. “Repeated assessment of results in clinical trials of cancer treatment.” The British Journal of Radiology 44: 793-797. doi:10.1259/0007-1285-44-526-793.
5 Peto, R., M.C. Pike, P. Armitage, N.E. Breslow, D.R. Cox, S.V. Howard, N. Mantel, K. McPherson, J. Peto, and P.G. Smith. 1976. “Design and analysis of randomized clinical trials requiring prolonged observation of each patient. I. Introduction and design.” British Journal of Cancer 34 (6): 585–612. doi:10.1038/bjc.1976.220.
6 O’Brien, P.C., and T.R. Fleming. 1979. “A Multiple Testing Procedure for Clinical Trials.” Biometrics 35: 549–556. doi:10.2307/2530245.
7 Lan, K.K.G., and D.L. DeMets. 1983. “Discrete Sequential Boundaries for Clinical Trials.” Biometrika 70: 659-663. doi:10.2307/2336502.