Improve your A/B tests with 9 lessons from the COVID-19 vaccine trials

A full year has passed since results from the first clinical trials of COVID-19 vaccines became available. Vast swaths of follow-up observational data are now in circulation which enables some conclusions about the strengths and weaknesses of those trials. Since modern medical trials are among the most rigorous and scrutinized experiments they can serve as a case study with applications to online experiments with which they share more similarities than may be apparent.

For many years I have promoted and implemented statistical methods in online A/B testing that closely match the ones used in most COVID-19 trials. Here I offer my take on the lessons these trials hold for practitioners of online experimentation. The article focuses on the design of the experiments and the statistics involved and makes no explicit or implicit claims of medical or epidemiological expertise on my part.

The focus is on the Pfizer and Moderna trials in particular, with one offering a frequentist perspective and the other a somewhat contrarian one. Hopefully, it will help you improve the rigor of your own A/B tests.

A quick navigation should you want to revisit a point:

  1. Prevent observer bias through blinding
  2. Test efficiently with robust results through sequential monitoring
  3. Choose an appropriate significance threshold
  4. Determine the required sample size
  5. Monitor multiple metrics
  6. Report several statistical estimates
  7. Maximize representativeness of the sample
  8. Changes over time can be devastating for the validity of even the most rigorous result
  9. Associate clear actions with each possible outcome

On to lesson #1:

Prevent observer bias through blinding

Both trials offer a good example on this point. Take Moderna’s research protocol title [2]:

“A Phase 3, Randomized, Stratified, Observer-Blind, Placebo-Controlled Study to Evaluate the Efficacy, Safety, and Immunogenicity of mRNA-1273 SARS-CoV-2 Vaccine in Adults Aged 18 Years and Older”

(emphasis mine)

Right from the get-go an important property of most clinical trials becomes apparent and it is the removal of observer-induced biases in research. Observer-blind means that neither the results as a whole nor fractions of data will be available to most personnel involved in the experiment. This leaves a select few – namely those who make sure the treatment is applied as prescribed, as the only ones with access to actual data on which person is assigned to which treatment. These people are removed from the decision-making process so that knowledge does not influence it.

The DSMB (Data and Safety Monitoring Board) are the only people who can see actual (unblinded) outcomes at prespecified points in time. Only they can make decisions based on that data. Ideally members of a DSMB are not operationally involved and should have minimal stakes in a trial’s outcome.

How to implement the blinding principle in A/B testing? An example application of the blind observer approach would be to make sure only the lead A/B testing practitioner can view the actual data in its fullest. Everyone else gets to see a blinded version, and ideally only at prespecified points in time, until the experiment is concluded one way or another.

For example, test results may be presented with neutral test group names such as “Variant A”, “Variant B”, etc. The control group is also called a “variant” in the blinded version of the data. Doing this significantly reduces the chance of anyone in the decision-making chain from playing favorites or messing with the execution of the test, as long as it does not deviate significantly from the pre-agreed parameters.

It is not rare to hear of executives wanting to cut a test short without abiding with the agreed upon monitoring schedule for either perceived efficacy, or lack thereof, or futility. While such practices will not be countered completely, blinding should at least prevent biasing the outcome in favor of or against a particular A/B test variant.

Test efficiently with robust results through sequential monitoring

In both clinical trials the test monitoring times are largely predetermined either based on time of evaluation (number of days since start for Pfizer) or percentage of the total target cases (for Moderna). Statistical stopping bounds are constructed to ensure statistical error guarantees hold despite the multiple possible data evaluations.

Here is what that looks like in the Moderna trial protocol [2]:

“The overall Type I error rate for the primary endpoint at the IAs and the primary analysis is strictly controlled at 2.5% (1-sided) based on the Lan-DeMets O’Brien-Fleming approximation spending.”

And further:

“There are 2 planned IAs at 35% and 70% of total target cases across the 2 treatment groups. The primary objective of the IAs is for early detection of reliable evidence that VE is above 30%. The Lan-DeMets O’Brien-Fleming approximation spending function is used for calculating efficacy bounds and to preserve the (1-sided) 0.025 false positive error rate over the IAs and the primary analysis (when the target number of cases have been observed), relative to the hypothesis:” where IAs stands for “Interim Analyses”.

It is refreshing to see that even the so-called Bayesian analysis of the Pfizer-BioNtech trial included in its design measures against inflating error probabilities due to interim analyses [3]:

“The final analysis uses a success boundary of 98.6% for probability of vaccine efficacy greater than 30% to compensate for the interim analysis and to control the overall type 1 error rate at 2.5%. Moreover, primary and secondary efficacy end points are evaluated sequentially to control the familywise type 1 error rate at 2.5%.”

This is something many proponents of Bayesian approaches in online A/B testing deny is appropriate or necessary and even tout it as a huge benefit that it does not apply to Bayesian analyses.

One lesson here is that when there are costs associated with delaying the distribution of a beneficial intervention or in continuing a futile trial, an experimenter can turn to sequential monitoring of the test data in order to make decisions “early” (as opposed to a fixed-sample design). This is by definition the case in any A/B test.

Another insight is that if one is going to evaluate and act on results as the data gathers, appropriate statistical design, and adjustments of the resulting statistical estimates to account for that are needed. Otherwise “peeking” occurs, error-rates become inflated, and statistical guarantees can be severely compromised.

I’ve championed error-spending approaches and the Lan DeMets family of functions, and Lan-DeMets alpha and beta spending boundaries are what has been implemented in Analytics-Toolkit.com since it started offering sequentially monitored A/B tests.

An important feature of the error-spending approach is that the timing of the analysis can be somewhat flexible with regards to the sample size achieved at each analysis, and the number of analyses can also be adjusted if interim results or other considerations require it. This is almost a necessity for most online tests where both weekly traffic levels and event rates fluctuate due to various types of shocks and seasonality.

Monitoring of the test’s data as it gathers, even with proper stopping bounds, introduces conditional and unconditional biases in naïve estimates computed after stopping a test. Therefore, an indispensable part of using error-spending is the implementation of adjustments needed to achieve unconditionally and conditionally bias-reduced estimates such as effect size estimate, p-values, and confidence intervals.

Choose an appropriate significance threshold

A significance threshold is the minimum evidential threshold which must be met by a statistical test in order to draw a conclusion about the tested hypotheses and take subsequent action with a reasonable level of uncertainty.

In a highly regulated industry where meeting the minimum regulatory standard is typically how one secures the most value for shareholders, it is not surprising to see that the regulatory minimums are applied as de-facto standards. In the case of FDA-monitored trials, that minimum is stated as a significance threshold for a two-sided test of 0.05 (equivalent to a 95% confidence level).

Since the outcome is directional both trial protocols state their error guarantees as one-sided alphas [1-4]. At 0.025 they match the minimum requirement of the FDA of 0.05 two-sided exactly.

One lesson here is that there is nothing extraordinary or unusual about using one-sided tests. Using a one-tailed test and so a one-sided hypothesis does not mean that the statistical test will only work if the true effect is in one direction. In fact it is necessary to frame the error control in terms of a one-sided significance threshold if the resulting inference will be directional, such as “The vaccine confers protection against infection and poor health outcomes“. It is also much more efficient in terms of sample size and statistical power.

An example of a directional claim in A/B testing is “User experience A results in better business outcomes than the current one”. One-sided p-values and one-sided confidence intervals need to be reported alongside such a claim.

Back to the clinical studies. On the one hand, one can criticize the choice of a 0.025 threshold as being too lax on the basis that the financial and health consequences of a false positive are too great to be satisfied by such a criterion. On the other hand, it is also possible to make the case that if a more stringent threshold was employed it would reduce the power of the test (more on this in the next section), or that it would require a larger sample size which could potentially delay the release of a benefit to a wide number of target individuals or expose a greater number of individuals at risk during the test if done in a short time span.

A proper discussion would require a dive into the complex trade-offs faced by both vaccine manufacturers and regulatory agencies and is well outside the bounds of this article.

A lesson can still be drawn from this and applied to business experiments. Since in most A/B testing scenarios one can calculate decent estimates of most of the relevant costs and rewards, incorporating this information through a risk-reward analysis can result in an ROI-optimal online experiment. This means that its sample size and significance threshold would achieve optimal balance of risks and rewards, regardless of outcome.

Determine the required sample size

One way to think of the sample size is as a function of the statistical power, the minimum effect of interest, and a given significance threshold. The power level used in both trials was 90%. However, the Moderna trial targeted a 60% efficacy rate whereas BioNtech targeted 30%. The difference in the minimum effect of interest resulted in the significantly higher sample size requirement for the latter versus the former (~43,548 vs 30,420).

A lesson for A/B testing is that testing faster, all else being equal, invariably means a higher probability of missing out on meaningful actual lifts of a smaller magnitude. In a field where typical observed effect sizes (not to mention actual ones) are in the range of single digits percentages, and sometimes lower, having less than 90% power to detect meaningful effects can be devastating for an A/B testing program. It would probably result in way too many false negatives. Those are missed opportunities to improve the business as well as wasted efforts in development and testing.

This question of what sample size or test duration to set for a test is inextricably linked to the choice of the statistical significance threshold. Since the definition of “meaningful” depends on the sample size / test duration, a feedback loop occurs in the determination of appropriate test duration and significance threshold. Using an approach for determining ROI-optimal parameters based on business metrics is the only solution which deals with the problem head on.

Monitor multiple metrics

Both vaccine experiments list a number of metrics (endpoints) with a leading efficacy endpoint and secondary efficacy and safety endpoints. This is good practice as it allows the same experiment to produce a set of useful data instead of just a single comparison.

While powering the test for all endpoints is often not done, it is still a good practice to monitor the performance based on several key metrics.

In A/B testing this includes monitoring so-called guardrail metrics, but also secondary metrics. A negative outcome on certain secondary KPIs might nullify the positive effect on a primary metric. Typically, the action following an A/B test will be informed first and foremost by the outcome on the primary metric, which might itself be a composite of several different metrics. However, what happens when metrics have different expected effect windows? For example, an email campaign with good average return in terms of sales revenue per sent email (short-term benefit) but which irritates other email subscribers and pushes them to unsubscribe. Depending on the rate of unsibscribes it may end up being a long-term loss.

In addition to the above, secondary metrics can often provide further insight into what is driving the main result.

Report several statistical estimates

In the reports following both the bnt162b2 of Pfizer-BioNtech and mRNA-1273 of Moderna, multiple statistical estimates for the primary and secondary endpoints are shared. For example [1]:

“vaccine efficacy was 94.1% (95% CI, 89.3 to 96.8%; P<0.001)”

for mRNA-1273, and for bnt162 [3]:

“Among participants with and those without evidence of prior SARS CoV-2 infection, 9 cases of Covid-19 at least 7 days after the second dose were observed among vaccine recipients and 169 among placebo recipients, corresponding to 94.6% vaccine efficacy (95% CI, 89.9 to 97.3).”

Some A/B testing practitioners focus exclusively on a single estimate – be it the effect size (a.k.a. percentage lift), a p-value or confidence level, or a confidence interval. While each has its merits on its own, a lesson that can be taken from the COVID-19 vaccine trials is that many estimates work best when used in parallel.

A few remarks are necessary on the reporting in the two medical trials. Moderna should have reported the exact p-value and not just “P<0.001” as it would have conveyed more information. Despite committing to type I error control in their statistical design, Pfizer regrettably chose to omit reporting the most-informative measure related to it (the observed p-value). However, it can be inferred from the CI that it will likely be close to that of Moderna.

In my opinion both should have reported the null hypothesis against which the p-value was calculated to help avoid possible misinterpretations, e.g. “(P = 0.0006; H0: VE <= 30%)”, based on “For analysis of the primary end point, the trial was designed for the null hypothesis that the efficacy of the mRNA-1273 vaccine is 30% or less.” which, to Moderna’s credit, was stated separately in the Statistical Analysis section.

Maximize representativeness of the sample

Both the Pfizer and Moderna experiments have careful inclusion criteria which aim to cover the population most likely to experience the treatment in the post-experimental phase, thus ensuring the representativeness of the results.

Pfizer recruited participants in several different countries meaning the outcome should translate more easily across borders. Moderna’s trial was limited to the US (as far as I can determine) meaning that potential external validity issues could be raised.

Moderna also has a stratification criterion which ensures the experiment will include a certain proportion of key demographics at which the vaccines were targeted (by age and risk status). This can be warranted when there is a need to analyze results and take action for specific subpopulations which may otherwise end up with a sample size which results in estimates with high uncertainty.

The time intervals between observations were sufficiently large to allow short-term effects to demonstrate themselves, even if a trial was stopped early. Longer-term negative outcomes would be monitored even in case the trial was stopped at an interim analysis to account for possible lingering issues.

In online A/B testing representativeness can be improved by allowing for the smoothing of time-related biases and the suppression of any learning effects with time. This happens by spacing out analyses in a way which avoids some time-related biases, and by running tests for longer periods. For example, a possible day of week effect can be avoided with analyses on a weekly basis, or larger intervals, whereas running tests for longer is the main way to address learning effects of both kinds.

In general, it would be a good idea to shy away from statistical plans which may end a test in just a few days in face of an overly large observed effect size.

Regarding geographical and cultural validity issues, I would advise against running a test in a single country and then interpreting the results as holding worldwide, be it positive or negative ones. If markets with different key characteristics are of business interest, an experiment should ideally be conducted for each one, before implementing a change for it.

Changes over time can be devastating for the validity of even the most rigorous result

This lesson from the COVID-19 trials is closely related to the external validity of test outcomes discussed above with a slight but important twist. Even if we assume a flawless test design and execution, and no generalizability issues to speak of at the time of conducting the experiment, such issues may arise at any point in the future.

While I do not know that that was the case with the vaccine trials, it is becoming apparent through observational studies and further experimental research that changes such as virus mutations, virus propagation rates, health policy, and others have resulted in a difference between the expected and observed efficacy of the vaccines in terms of preventing disease and possibly even in the prevention of severe disease and lethal outcomes. Simply put, applying results from one environment to another is a risky business even with very reliable experiments.

Sadly, similar threats exist for the outcomes of even the best A/B tests.

Such threats stem from the fact that human behavior can and does change over time, sometimes abruptly and significantly so. In online A/B testing the risks of such changes come from the overall technological environment, competitor actions, shifts in design and UX paradigms, shifting economic conditions, changing laws and regulations, etc.

Associate clear actions with each possible outcome

Finally, and often of utmost importance, each experiment should have a well-defined set of actions following its possible outcomes. This is the case for both the Pfizer and Moderna trials, and for most large modern clinical trials. While there may be debates with regards to the different aspects of an experiment’s execution and how a certain piece of data is to be interpreted, once these are resolved the road from there to specific action is pretty much determined before a trial even gets approved.

In business testing it is not rare to hear of a well-conducted experiment with highly convincing results which are tossed away by a single HiPPO. To avoid such situations as much as possible, one should get pre-approval of different courses of action such as “implement the copy change” if the test has passed certain checks and met its thresholds.

It should be noted that this does not mean that “the numbers should be followed blindly”, but that for an experiment to be conducted, all tested variants should be considered good candidates for implementation when viewed holistically before the test even begins. If they are not, then they have no place being tested in the first place.

Conclusion

This concludes our brief excursion into the land of medical research. While not an exhaustive examination of the design of the vaccine trials, the points raised will hopefully help you level up your A/B testing.

References

1 Baden et al. 2020, “Efficacy and Safety of the mRNA-1273 SARS-CoV-2 Vaccine”, New England Journal of Medicine 384(5):403-416, doi: 10.1056/NEJMoa2035389
2 Moderna Protocol mRNA-1273-P301, “A Phase 3, Randomized, Stratified, Observer-Blind, Placebo-Controlled Study to Evaluate the Efficacy, Safety, and Immunogenicity of mRNA-1273 SARS-CoV-2 Vaccine in Adults Aged 18 Years and Older”
3 Polack el.al. (2020) “Safety and Efficacy of the BNT162b2 mRNA Covid-19 Vaccine”, New England Journal of Medicine 383:2603-2615, doi: 10.1056/NEJMoa2034577
4 Pfizer / BioNtech Protocol C4591001, “A Phase 1/2/3, Placebo-Controlled, Randomized, Observer-Blind, Dose-Finding Study to Evaluate the Safety, Tolerability, Immunogenicity, and Efficacy of SARS-COV-2 RNA Vaccine Candidates Against COVID-19 in Healthy Individuals”

About the author

Georgi Georgiev

Managing owner of Web Focus and creator of Analytics-toolkit.com, Georgi has seventeen years of experience in online marketing, web analytics, statistics, and design of business experiments for hundreds of websites.

He is the author of the book "Statistical Methods in Online A/B Testing" and white papers on statistical analysis of A/B tests, as well as a lecturer on dozens of conferences, seminars, and courses, including as a Google Regional Trainer.

This entry was posted in A/B testing, Statistics and tagged , , , , , . Bookmark the permalink. Both comments and trackbacks are currently closed.

Take your user testing program to the next level with the most comprehensive book on A/B testing statistics.

Learn more