How to Run Shorter A/B Tests?

Running shorter tests is key to improving the efficiency of experimentation as it translates to smaller direct losses from testing inferior experiences and also less unrealized revenue due to late implementation of superior ones.

Despite this, many practitioners are yet to start conducting tests at the frontier of efficiency. This article presents ways to shorten the duration of A/B tests by up to 40%, depending on your starting point so by the end of it you may have discovered how to start running tests nearly twice as efficiently as before.

Suppose you are about to conduct an A/B test that would take 50,000 users per variant for a total of 100,000 users, and that it would take five weeks to run with the current traffic to the pages the tested change affects. How does one run such a test in just three weeks and expose just 60,000 users in total, therefore getting closer to the frontier of efficiency?

Use one-sided tests

The first thing one should do is align the tested substantive claim with the statistical hypothesis. By doing so one would discover that a one-sided test is appropriate for most if not all of their A/B tests. Incorrectly doing experiments with two-sided tests means taking 20%-60% longer time than necessary, with the average being ~27%. Such tests also produce inaccurate uncertainty estimates including twice larger p-values and much wider confidence intervals.

While I love pointing out that there is no free lunch in statistics, this doesn’t apply in cases where one addresses an error in their approach. Switching to the right kind of test in this case results in efficiency gains of between 15% and 40% (~21% being the expected value) without suffering from any negatives. This means that a test which would have required 100,000 users to run now only requires about 79,000.

By simply aligning your statistical hypothesis with the substantive claim being tested, you can gain from 15% to 40% efficiency in your typical A/B test.

Why is anyone still using two-sided tests if that’s the case, then? There are a manifold of reasons, with two being of primary importance in my view:

  • many tool vendors offer only two-sided tests or offer them as the default choice
  • the apparent paradox of one-sided testing vs two-sided testing

The seeming paradox spawns all kinds of misconceptions such as that one-sided tests produce more false positives / more type I errors, that they somehow bias the results, that they possess insufficient rigor, that they are less stringent etc. when compared to their two-sided counterparts.

The paradox is explained and exposed as non-existing in “The paradox of one-sided vs. two-sided tests of significance” while a more accessible debunking of the myths and misconceptions in the context of A/B testing can be found in “One-tailed vs Two-tailed Tests of Significance in A/B Testing” and Chapter 5 of “Statistical Methods in Online A/B Testing”.

The pertinent question is not why should one use one-sided tests, but rather why isn’t this already the case, given that a two-sided test is rarely, if ever, what the business question behind an A/B test translates to in terms of a statistical hypothesis test.

If you’re already using one-sided tests (e.g. because you are using Analytics-Toolkit.com for your statistical analyses), then I apologize for disappointing you with this section, in a way.

Adopt sequential testing

Sequential testing is the practice of evaluating test data as it gathers, according to some schedule, with the intent to stop if a certain decision bound is crossed. There are various approaches that can be adopted with most methods falling into one of two broad categories: group sequential or fully sequential. See this article for a comparison of the two approaches.

For the primary metric of most online A/B tests one would want to use a group sequential approach for reasons of efficiency and external validity. If using an approach similar to AGILE in which there is both an efficacy and futility boundary the expected improvement in test duration is around 26% as shown by a real-world sample of 1,001 tests. The theoretical improvement is from 20% to 80%, but where in the range it will be depends entirely on the actual true effects of the changes being tested. The latter obviously depends on the practitioner and the types of tests conducted.

Sequential testing makes a smart trade-off between statistical power and average sample size required to reach a decision. Average real-world gains are 26% while theoretical gains are in the range 20-80%!

It should be noted that the efficiency of fully sequential methods such as SPRT or mixture SPRT (such as in Always Valid P-values) is significantly worse than group sequential methods and the expected efficiency gains there are much smaller, even under the most favorable conditions (namely that a true positive effect exists). See Comparison of the statistical power of sequential tests: SPRT, AGILE, and Always Valid Inference for the full account.

Reducing the 79,000 users in our hypothetical experiment by a further 26% we get the expected sample size for the test to be just 58,460 users in total. To run the test for three full weeks and maintain better external validity one can round this number up to 60,000.

The reduction from sequential testing is not of the same kind as the one from the one-sided test since it reduces the expected sample size at the cost of reduced statistical power. This, however, is a real gain in average efficiency and results in a beneficial trade-off vis-à-vis optimal balance of risk and reward, even though it is to be realized over a series of A/B tests instead of being a fixed effect for each individual test.

See this in action
Try itA/B Testing Calculator
Advanced significance & confidence interval calculator.

Is there something else that can be done to shorten the duration of our A/B test and reduce the required sample size?

Choose your uncertainty threshold wisely

The goal of each A/B test is to gain information that assists in decision-making with a minimum ratio of probable risk to probable reward. An optimal confidence / significance threshold can be determined for each test or sets of similar tests through risk-reward analysis. In many cases, however, practitioners resort to conventions instead. This leaves significant room for improvement in case the optimal threshold is that of greater uncertainty compared to the conventional 95% confidence threshold (0.05 p-value threshold).

Make an uncertainty threshold choice informed by a risk/reward optimization algorithm. Using optimal parameters can lead to substantial reductions of A/B test sample size requirements.

What reductions to test durations can be expected from a risk-reward optimization approach to test design is unknown. There is no available empirical data and way too many possible scenarios to reasonably simulate. However, going from 95% confidence threshold to 90% confidence threshold leads to over 27% smaller sample size. The reduction can be greater if 85% or 80% is the suitable level of uncertainty instead. This should give you an idea of the possible efficiency gains by using informed choices of test parameters.

See this in action
Try itA/B Testing Hub
The all-in-one A/B testing statistics solution

Just like substituting a two-sided test with a one-sided test, the efficiency gains in this case are just the outcome of a proper procedure for designing tests, not the end goal. The shorter test duration is a side-effect of performing a test with optimal business return.

Use pre-test data for variance reduction

The article has already delivered on its promise to increase your test efficiency by 40% by getting from a sample size of 100,000 to an expected sample size of 60,000, presuming that you haven’t already been using one-sided sequential tests such as the ones offered by Analytics Toolkit. It may have even gotten you to above 50% increased efficiency if a smarter choice of test parameters resulted in a lower uncertainty threshold.

There is one more way to increase the efficiency of your tests. However, it is only applicable under specific circumstances and requires specific types of tests, hence I consider it a bonus.

The CUPED method being referred to here only applies when there is reliable persistent identification of all or the majority of the users enrolled in a test, and the availability of pre-test data on the outcome of interest since it is essentially a multiple regression involving these pre-test outcomes. It is also not applicable to situations when user behavior before the test does not correlate with user behavior during the test, as measured by the primary KPI (which is often the case in e-commerce!). Such requirements disqualify most acquisition-side tests and a number of others, such as tests with retention (a.k.a. survival) metrics, however it is a method that can be applied in a range of other scenarios.

The mileage in terms of shortening test durations varies from metric to metric and is dependent on the chosen lookback period. The higher the correlation between pre-test outcomes and outcomes during the A/B test, the greater the variance reduction would be. Since I have not found a solid reference to cite for the expected efficiency gain from adopting CUPED, I will refrain from giving a concrete number.

Conclusion

The article shows how proper statistical design in terms of the type of hypothesis used (one-sided instead of two-sided) combined with using advanced statistical methods (group sequential testing) can lead to a 40% increase in the efficiency and therefore speed of A/B testing, on average. A hypothetical two-sided fixed sample test requiring 100,000 users would be expected to recruit only 60,000 users when replaced with the appropriate one-sided sequential test.

An additional possibility to unlock shorter test durations is through an informed choice of test parameters such as the confidence / significance threshold. Depending on the circumstances, it can lead to significantly decreased sample size requirement and therefore test duration.

Finally, a regression including pre-test outcomes can be used given that pre-experiment data is available. While only suitable for certain types of businesses and certain types of A/B tests, it can result in shorter tests when applicable.

About the author

Georgi Georgiev

Managing owner of Web Focus and creator of Analytics-toolkit.com, Georgi has over eighteen years of experience in online marketing, web analytics, statistics, and design of business experiments for hundreds of websites.

He is the author of the book "Statistical Methods in Online A/B Testing" and white papers on statistical analysis of A/B tests, as well as a lecturer on dozens of conferences, seminars, and courses, including as a Google Regional Trainer.

This entry was posted in A/B testing, Statistics and tagged , , , , . Bookmark the permalink. Both comments and trackbacks are currently closed.

Take your user testing program to the next level with the most comprehensive book on A/B testing statistics.

Learn more