This question pops up often in more “advanced” discussion forums and blogs on Conversion Optimization and usually one (or several) of the following advices are given:
Do an A/A test first in order to test your split testing framework. If the difference between the two is statistically significant at the decided level, then your framework is broken.
Do an A/A/B or A/A/B/B test and throw out the result if the two A treatments or the two B treatments show different results that are statistically significant at the decided level.
Do many A/A tests (say 100) and if more tests than expected show statistically significant differences, then your framework is broken/statistical significance is useless/CRO doesn’t work, etc.
The above are not direct quotes, but summaries of what I’ve read on various places. Given that some of them included advice such as “If the tool tells you that one variation in an A/A Test is significantly better than the other, well, it’s time to change.” it’s easy to get an idea of why I began writing this post. If it’s so easy to test a split testing platform, then such advice should be spread more widely. If, on the other hand, it’s unsupported, it should be criticized and debunked.
It is only natural that professionals and clients of CRO question the accuracy of the results of split testing tools and of conversion optimization work. I think there are good reasons for such questions, given the relatively widespread misunderstanding of split testing, statistical significance and hypothesis testing in general among CRO and online marketing professionals, let alone layman. However, I believe doing any of the above variations of A/A tests is going to only waste of your money, time and efforts and will be able to say little to none about the accuracy of your CRO tools and activities.
What is an A/A test?
An A/A test is simply a test where both pages/ad copy/creatives/etc. are exactly the same. So you are splitting your traffic randomly and showing both groups the same page and then registering the reported conversion rates, click through rates and the associated statistics for each group in order to hopefully gain some knowledge.
Let’s examine what A/A tests actually accomplish, in order to see if they have any value or not.
The simple A/A test
Doing a single A/A test ultimately can’t tell you anything you didn’t know before starting it. The logic is: if these two samples from the (presumably) same population show a statistically significant difference, then the testing software is broken and can’t be trusted. The flaw here is lack of understanding of sampling theory / error statistics.
Sampling theory takes into account all possible outcomes of a test. So if you do an A/B test, the test statistic that you’ll be looking at (statistical significance, for example) takes into account all possible A/B test outcomes that could have occurred (but didn’t). Statistical significance tells us how likely it is would we see the observed difference in performance between A and B, under the assumption that A is equal to B. (statistical significance in our calculator and in most online calculators is reported as 1-alpha, so it measures how unlikely it is to see the observed difference)
Having run one A/A test that shows X percentage points difference with a 99% statistical significance tells us nothing about the reliability of the test. Even if everything is done properly, e.g. sample size is fixed in advance (or adjusted for sequential testing) the result means nothing about our testing suite/framework. We could have observed a rare event, one that has just 1% chance of happening if A is equal to A (as it is), but its still possible that it happens. Analogously, it’s very unlikely for a person to win the lottery, but we don’t call the lottery rigged when someone wins.
So, as we can see, running an A/A test is a waste of your money. Let’s see what happens when we are
Running an A/A/B or A/A/B/B test
The idea here is that the duplicated A or B treatments somehow provide a measure of the accuracy of an A/B split test. If the difference between A and A or B and B are statistically significant, then we consider the test flawed and discard its results.
The problem here is exactly the same as in the A/A scenario above. There are errors/fluctuations in the data which give rise to uncertainty. This is well accounted for by the statistical significance number:
- The more fluctuation there is in the sample data, the less significant* the result.
- The smaller the data set, the smaller the significance* of the result of the test.
- The smaller the observed difference between the treatment samples, the less significant* the result.
This test runs into other issues, though. It greatly increases the chance to get a false positive, because of the multiple comparisons problem. The more A/A/A/A/A or B/B/B/B/B variations you test, the more likely it is for one of them to register a statistically significant difference at your preferred level (95%, 99%, whatever). The same holds for A/B/C/D/E/F tests, as I have explained here. So not only doesn’t it give us any information about the accuracy of our split testing procedures, but it is highly likely that it will fool us into discarding perfectly fine results and wasting even more money in tests that generate no new insights.
Now, the most interesting case:
Running multiple A/A tests
Now this is something that I think has merit in trying to test the assumptions of split testing for conversion optimization. For example, if you run 1000 subsequent A/A tests with large audiences, you do everything by the book (multiple comparison adjustments, proper stopping rules, etc.) and you get statistically significant results much more often than expected, then maybe your A/B testing framework is broken? Maybe the samples aren’t properly randomized? Maybe the two treatments aren’t independent? Even if you only need 500 or a 100 A/A tests to observe statistically significant deviations from the expected results, it’s still a huge waste of money. Simply because impressions, clicks and visitors aren’t for free, not to mention what you could have been doing with that traffic.
Using time-tested split testing software solutions and doing your statistical work right seems to be much better investments for anyone’s money, time and effort.
Verdict and Recommendations
If you really have resources to waste go ahead and do many many A/A tests or one big A/A/A/A/A/A/… test, account for multiple comparisons and have proper stopping rules. If your tests show a statistically significant divergence in the number of tests that show statistically significant differences, then you might have something to worry about. In all other cases, just invest these same resources in proper A/B or A/B/N tests. If you simply want to be more confident in the conclusions of the tests, just increase the required statistical significance level.
Hope that makes sense. Let me know your thoughts.
* in all these occasions I speak of significance levels as reported by most split testing software and calculators, not the statistical α which is the inverse. In short “statistical significance” here is equal to 1-α.