I’ll start this post right with the punch line:
Internet marketing at its current state is dominated by voodoo magic.
There, I’ve said it, and I’ve said it in bold italic so you know it must be true. Internet marketing is not as data-driven as most of us have been led to believe, it is not as fact-based as you might think and… most specialists are not aware of that fact. Unfortunately, this is not a joke – I can prove it and I intend to do just that in this post. I will also demonstrate how this is hurting the results of everyone engaged in online marketing.
Aren’t Marketers Using Hard Data to Arrive at Conclusions?
In reality – they mostly aren’t. And that’s not just my own experience over the 10 years I’ve worked as an SEO, SEM and web analytics specialist. A 2012 CEB study cited on the Harvard Business Review polled nearly 800 marketers at Fortune 1000 companies and found the vast majority of marketers still rely too much on intuition. And, to add insult to injury – the few who do use data aggressively for the most part do it badly.
Here are some key findings of the study:
Most online marketers rely too much on gut – on average, marketers depend on data for just 11% of all customer-related decisions. When asked about the information marketers used to make a recent decision, they said that more than half of the information came from their previous experience or their intuition about customers. They put data last on their list — trailing conversations with managers and colleagues, expert advice and one-off customer interactions.
A majority of online marketers struggle with statistics – when the marketers’ statistical aptitude was tested with five questions ranging from basic to intermediate, almost half (44%) got four or more questions wrong and a mere 6% got all five right. So it didn’t surprise us that just 5% of marketers own a statistics text book.
The conclusions of this study, which can be rephrased as “Marketers are not much different than shamans”, should definitely attract some attention in the 21-st century, right? I thought such news should have made waves in the industry and in the management circles. Sadly, I was wrong – it barely got any traction. The most horrifying thing is that most of the people commenting on that and similar topics call for some kind of a “unification” between gut feeling/intuition and data. This can mean only one thing – they have no conceptual understanding of what a fact is and how it is established.
Is it really THAT bad?
Unfortunately, yes, and probably even worse than you think.
Try and remember when was the last time you saw the very basic concept of “statistical significance” mentioned in an internet marketing (SEO, SEM, conversion optimization, etc.) case study, whitepaper, blog article or forum post? How many web analytics tools, dashboards or widgets give you such data? Now what about a concept no less crucial than statistical significance – the “statistical power” of a test? What about the “peeking/curious analyst” problem? Do “multiple significance testing problem” or “multiple comparisons problem” ring any bells?
If you’ve never heard those terms you might want to brace yourself for what’s to come, since they are all affecting your decision making processes virtually all the time, no matter if you are doing SEO, SEM, conversion optimization or web analytics in general.
But before we dive into the deep, we need to take a step back and remind ourselves some basic stuff.
How Do We Gain True Knowledge About The World?
The method all knowledge is based upon is called inference. We observe a sample of a given phenomena and based on our observations we infer facts about the whole concept it represents. A simple example: we observe that all objects, if lifted above ground and dropped, fall in a straight line with a given velocity – we correctly infer gravity. On the other hand, we might also observe several tall red-haired drunken men who steal and murder and wrongfully infer that all red-haired men steal and murder.
But how do we avoid making such invalid inferences?
The answer is proper hypothesis test design and proper use of statistics. An example of proper test design would be to randomize our sample and to make it as representative of the whole population as possible. In the example above that would mean including randomly selected red-haired men. This will increase the chance of including sober guys and short guys as well, thus making our conclusion about all red-haired guys much better aligned with reality. If we are more advanced, we would also want to keep the proportions of important characteristics in our observed sample as close to the proportions in the whole population as possible as well.
An example of proper use of statistics would be to calculate the statistical significance of our observed data against a baseline. Statistical significance (often encountered as “p-value”) does not eliminate the uncertainty of our inferences, rather it quantifies it. Thus it allows us to express degrees of certainty that the observed results are not due to chance alone (like the red-haired guys observation). Then it’s similar to a poker game: if you have a better certainty that you have the upper hand, you are happy to bet more on it.
Why Should Online Marketing be Based on Science?
What needs to be understood is that in online marketing we attempt to arrive at a truth statement, usually both explanatory and predictory in nature. Then, based on it we make recommendations and execute specific actions to improve business results. What many people without a scientific or philosophical background fail to understand is that making a truth claim puts them immediately in the realm of science. This means “rational empiricism” and the scientific method is the tool of choice.
The dictionary definition of the scientific method is that it is “a body of techniques for investigating phenomena, acquiring new knowledge, or correcting and integrating previous knowledge”. Put in plain English, it is currently the only proven approach which allows us to distinguish true facts from tricks of the senses and the mind. Integral parts of this approach to objective reality are: solid statistics - for establishing empirical facts, and solid logic – for drawing conclusions and making connections between those facts. The inference example above was a simple demonstration of the scientific method in action.
To the contrary, if you are not using the scientific approach you are in fact engaging in magic rituals. Even when you have the best intentions and a completely unbiased approach, lacking a solid understanding of statistics will often put make you chase the statistical “ghosts” or walk miles towards statistical “mirages”. You have a very good chance of spending a lot of money and efforts, thinking you are making progress while your results are not real.
In the paragraphs to follow I will examine 3 Basic Statistical Concepts every marketer should be intimately familiar with in order to avoid the above pitfall. These are: Statistical significance, Statistical power, and the Multiple comparisons problem.
As described above statistical significance is a major quantifier in null-hypothesis statistical testing (THE tool for empirical data analysis and inference of any kind). Most often in marketing we want to test if one or more variations are performing better than a control (ad copy, keyword, landing page, etc.). The default (or null) hypothesis is that what we are doing would not improve things (one-sided hypothesis). Thus in an experiment we seek to find evidence that we can reject this null hypothesis.
When we see improvement in a parameter we are measuring (CTR, CPC, conversion rate, bounce rate, etc.) we can’t simply state that the variant in question is better. This is because our observations are on a limited sample of all of the people who we are interested in. Inferring something for the whole based on a sample is inevitably prone to error. This brings the need to have an idea of how big the chance of error is in our data.
This is where statistical significance comes into play. Having a certain level of statistical significance expresses the likelihood that the difference we observe (or a bigger difference) would not be observed if there was no actual difference. To take a textbook example – having a statistical significant result at the 95% level means that under the hypothesis that there is no difference, there is only a 1 in 20 likelyhood that improvement as big, or bigger than the observed, could have been encountered by chance alone.
The bigger the magnitude of the observed difference, the less data is needed for confirm in a statistically significant way that there is an improvement. The more data we have, the easier it is to detect smaller improvements with a good level of significance.
Here is a more graphical illustration of the whole concept:
It wanted to a sense of how much the industry cares about this most important statistical concept so I did some digging around.
Industry Coverage on Statistical Significance
Avinash Kaushik is one of the first to write about the importance of statistical significance in web analytics back in 2006. LunaMetrics, a major source for web analytics know-how, has only a couple of mentions and just one short recent article devoted to the topic. Even the Google Analytics blog itself has only three mentions of the term in all its years of existence. Their leading evangelist Justin Cutroni has just 1 (one) mention of the term on his blog. On the bright side: KISSMetrics (great innovators in web analysis) have announced just several days ago that they are implementing statistical significance into their reports.
In the SEO corner we have SearchEngineWatch, a major source on SEO and PPC with only 20 articles that mention statistical significance in the past 7 years. SearchEngineLand – another very prominent source has only a couple of articles mentioning significance, the best of which dates from 2010.
The conversion optimization guild currently have the best grip on the topic: VisualWebsiteOptimizer, Optimizely as well as Google’s Website Optimizer had always had statistical significance implemented into the software and have also written about it.
And… that’s pretty much all there is amongst the more prominent sources – sporadic or shallow mentions with a few exceptions. Seems like this topic received very little attention from most of the industry leaders. But why is that bad and what can neglect of statistical significance cause? Let’s examine the most common mistakes made with regards to statistical significance.
Common Mistakes with Statistical Significance
The first and most blatant error is to fail to check for statistical significance at all. Most people fail at this point. One example would be to not wait until a controlled A/B test has achieved a decent significance level (as frequently observed, according to an industry source). Other examples would be to not check for significance when testing Ad Copy performance, when looking at keywords conversion rates/click-through rates in both paid search and SEO, etc.. Needless to say, this leaves us with no objective estimate of the validity of the data we operate with.
Such cases are even present in literature and on very prominent blogs. This example is from a recent post on the VisualWebsiteOptimizer blog which also claims to cite a bestseller book on testing:
Do you see the problems?
First, the numbers in the table do not correspond to the percentages. Let’s let this one slip and fix the numbers for them. Still, the difference in conversion rates between the scenarios and the baseline are no where near being statistically significant. This means that the registered difference between 2% and 2.2% is most likely non-existent, given the sample sizes. They are indistinguishable. As you can see this is no small error, but a glaring failure in an attempt to use the simplest of data.
Failure to measure statistical significance means that one can invest resources in literary ruining his own business, without even realizing it! From a business perspective statistical significance can be viewed as a measure of the risk involved in making a decision. The more statistically significant the result, the smaller the risk. So, the bigger the impact of the decision, the more you should rely on statistics and the bigger the required level of significance should be.
The second error is to report a statistically significant result but to fail to understand that statistical significance and practical significance (magnitude of the result) are completely different things. Statistical significance only tells us that certainty that there is a difference, it tells us nothing about how big it is! A statistically significant result can have an actual improvement of 0.01% (even if our sample shows a 50% lift).
Here is a quick example:
The achieved statistical significance is very high, but notice the blue numbers in the “% Change” column. They tell us that even though the improvement detected in our test is 60%, the actual improvement we can predict with 95% confidence, given the current data, is anywhere between 15% and 105%, with all values equally likely!
An infamous case of such neglect is a highly cited SEOMoz study on SERP ranking factors. The study claimed significant results where obtained that showed how different SEO factors correlated with SERP ranking. However, it failed to inform that even if the results were statistically significant they were in practice negligible in magnitude and didn’t support their claims for alleged signals correlation to SERP rankings. There were no real findings in this study, the level of correlation was miniscule. Detailed discussions on the issue can be found on Dr. Garcia’s blog here and here, as well as on Sean Golliher’s blog for computer science and AI here.
This type of false conclusions lead to overestimation of the achieved results, thus – the potential benefits. A cost/benefit analysis flawed in such a way can have terrible consequences.
The third error is to interpret a statistically insignificant result as support for lack of difference/lack of effect. This is where I need to introduce
Statistical power is related to the number of visits or impressions (sample size) you draw conclusions from, and the minimum improvement you care to detect. On one hand power defines the probability of detecting an effect of practical importance. The higher it is, the better the chance. On the other hand, the power of a test can have a great effect on statistical significance calculations to the point of rendering them worthless.
There are two ways to get power wrong: to conduct an underpowered or an overpowered test. A test is underpowered if it has inadequate sample size to detect a meaningful improvement at the desired significance level. A test is overpowered if it’s predetermined sample size is inadequately large for the expected results.
Statistical Power Mistake #1
The first thing that can happen with an underpowered test is that we might fail to detect a practically significant change when there actually is one. Using such a test one would conclude that the tested changes had no effect, when in fact it has – a false negative. This common mistake wastes a lot of budgets on tests that never receive enough traffic to have a proper chance to demonstrate the desired effect. What use is running test after test if they have inadequate sample size to detect meaningful improvements?
In an infamous use of underpowered tests in the 1970s car drivers were allowed to turn right on red light after the several pilot tests showed “no increase in accidents”. The problem was that they had sample sizes too small to detect any meaningful effect. Later, better powered studies with much larger sample sizes found effects and estimated that this decision in fact claimed many lives, not to mention injuries and property damage.
Statistical Power Mistake #2
The second thing that might happen with an underpowered test is that we increase the chance to detect a statistically significant (and often practically significant) result where there is none. The probability of committing such an error is correctly described by the significance level when the sample size for each group is fixed in advance. Fixing the sample size is done on the basis of the desired significance level and the desired power. Checking for statistical significance only after a test has gathered enough the required data shields you from such an error.
Failing to do that and stopping a test prematurely (before the required numer of impressions/visits, etc.) greatly increases the likelihood of a false positive. This is most commonly seen when one checks the data continuously and stops it after seeing a statistically significant result. Doing this can easily lead to a 100% chance to detect a statistically significant result, rendering the statistical significance values you see useless. Hurrying to make a call and stop a test as soon as a significant result is present is counterproductive.
Here is an actual example from CrazyEgg’s efforts in A/B testing:
“When I started running A/B tests on Crazy Egg three years ago, I ran around seven tests in five months. The A/B testing software showed that we boosted our conversions by 41%. But we didn’t see any increase in our revenue. Why? Because we didn’t run each test long enough…”
Reading further in Neil Patel’s blog post reveals that the issue he was experiencing was exactly the “premature stopping based on significance” issue above (his reasoning about how to avoid it isn’t correct, though). If we fall prey to the above error we might run one “successful” test after another and pat ourselves on back while at the same time achieving no improvements, or worse – harming business. I think many of us have been there and have implemented a “successful, statistically significant” test variant only to find out a few weeks or months later that the effect is not what we’ve estimated during the test.
As Prof. Allen Downey puts it in this excellent post: “A/B testing with repeated tests is totally legitimate, if your long-run ambition is to be wrong 100% of the time.”. NOT fun.
Here is a more detailed illustration for the second mistake with statistical power:
If we had the conversion rate data per variant as above, from the first 16 days of an experiment, and we had 95.9% statistical significance for the difference between the baseline and variant B at the end of day 16, should we declare B the winner and stop the test?
Here is the full data for the test – 61 days (the horizontal scale). On the left vertical scale you see the conversion rate percentages. The orange line is the statistical significance for variant B against the baseline (cumulative, right vertical scale). The two grey horizontal lines represent the 90% and 95% levels respectively.
If you choose the 95% level and you only wait for significance, without waiting for achieving the preset sample size, you would wrongly select variant B if viewed the test results between day 16 and day 22. For a 90% significance level you could have identified a false positive as early as days 3 and 4, as well as between day 11 and day 23. In the above example our test fails to produce a winner by day 61 since all variants regress back to the norm (bonus points if you noticed that the first two graphs are not zero based).
This example demonstrates why it’s important to wait until a test has reached a sample size where it has enough power. Since the issue here is “peeking” on the data and actually doing a statistical significance test on each day, this is also a custom case of the “multiple significance testing” problem that I discuss in detail bellow.
Unfortunately, such mistakes have been and likely are still encouraged by some A/B testing tools, so I would like to end the part on optional stopping with this quote from Evan Miller:
If you write A/B testing software: Don’t report significance levels until an experiment is over, and stop using significance levels to decide whether an experiment should stop or continue. Instead of reporting significance of ongoing experiments, report how large of an effect can be detected given the current sample size. Painful as it sounds, you may even consider excluding the “current estimate” of the treatment effect until the experiment is over. If that information is used to stop experiments, then your reported significance levels are garbage.
Statistical Power Mistake #3
Finally, in case we overpower a study we are most likely wasting money and other resources when we’ve already proven our point beyond the reasonable doubt we are willing to apply. I won’t go into further detail here and leave this for another post.
Industry Coverage on Statistical Power
Given all that I expected statistical power to be discussed at least as much as statistical significance. However, the mentions of statistical power are virtually non-existent in the most prominent blogs and sites that deal with internet marketing. Evan Miller was one of the first to cause some waves in the conversion optimization community with his 2010 post How Not to run an A/B test which tackles the issue of statistical power.
VisualWebsiteOptimizer is the other prominent source on this topic. They have quite a few mentions on their site and blog, with this post deserving a special mention. On the downside – they have a huge number of case studies featured on their site with only half of them containing information about the statistical significance of the result. I didn’t see statistical power being mentioned in any of them.
Optimizely only mentions statistical power in an obscure location on a long page about “chance to beat original” in their help files. This post by Noah from 7signals – determining sample size for A/B tests is another decent one I was able to find.
A good whitepaper by Qubit titled Most winning A/B test are illusory (pdf) inspired the guys at KISSmetrics. They did a decent article on the topic available here. However, when announcing the release of statistical significance reporting in their tool a few days later they failed to mention “statistical power” in that post.
Not only is “statistical power” a foreign concept to the industry, but even in articles that attempt to explain statistical significance (e.g.: ones titled “A Marketer’s Guide to Understanding Statistical Significance” and the like) power is commonly misunderstood and the lack of significance is said to mean lack of difference (mistake #1 above).
On Using Online Sample Size Calculators
Be careful about the following common peculiarities:
First, most, if not all of them calculate the sample size for a two-tailed hypothesis. Thus they are overestimating the sample size required to detect the desired effect when we are only looking for improvements and are not concerned about the precision of results that are worse than the baseline. Following their recommendations would result in wasted budgets for generating traffic/impressions.
Second, some only calculate based on a 50% baseline, which is useless for online marketing where we have varying baseline rates to work with. Finally, some others (e.g. CardinalPath’s free calculator) are reporting at a fixed power of 0.5 or 50%. Hence the chance to miss a significant result at the desired level of improvement is 1 in 2, too high for most practical applications.
In Analytics-Toolkit.com‘s calculator we’ve attempted to present a flexible solution that should fit most people’s needs properly.
Now that we have our significance and power right, let’s tackle the toughest of the four:
The Multiple Comparisons Problem
The multiple comparisons problem arises each time you study the results of a test with more than one variation by comparing each against the baseline. Put simply: the more we test for the same thing, the more likely we are to falsely identify a result as significant when in fact it isn’t. The more test variations we create, the more likely it is that some of them will demonstrate statistically significant results when there is no discovery to be made. This is one reason why marketers who test more variations per test report (falsely) much more and much better results than expected.
For example: we have a baseline AdWords keywords conversion rate and we are trying to find one or more keywords that outperform the baseline in a statistically significant (and sufficiently powered) way. To do this we decide that we want to be 95% certain and begin comparing the conversions rate of each of 13 keywords to that of the baseline, looking for one that would show an improvement at the chosen level of statistical significance.
If we were only running one keyword and needed to check only that keyword against the baseline, we would have been fine. However, if we have 13 keywords and we check all of them against the baseline, things change. If we get one significant result at the 95% level and decide to direct more of our budget to that keyword, that would be wrong.
Let me demonstrate why. Settling for 95% statistical significance means we are comfortable with a 1 in 20 probability that the observed change in the conversion rate is due to chance alone. However, looking at those 13 keywords’ conversion rates via pairwise comparisons to the baseline would result in that we would have about 50% chance that 1 of those differences in conversion rates shows up as statistically significant while in fact it isn’t. This means we would have a false positive among those 13 keywords pretty much every other time.
What this means, of course, is that the 95% statistical significance we are going to report for that keyword is in fact not 95% but significantly lower. This is not at all limited to marketing, pardon me. Some shocking papers have come out lately stating that “most published research findings are false“. If you are into comics, xckd has this awesome comics on the issue.
The desire to avoid those false positives is what justifies the need to adjust the significance values for those pairwise comparisons. We can do so based on the number of the tests conducted/variants tested, and their results in terms of significance.
I would argue that for practical applications in the internet marketing field the False Discovery Rate correction method is the most suitable one. With it we apply a correction to the significance level of each variation or group in order to estimate the true statistical significance when a multiple comparisons scenario is present.
Here is a simple example with only 3 variants (increasing the number of variants only increases the problem):
The differences in significance levels with and without correction are obvious. Bonus points if you’ve noticed that the above statistical significance calculations are likely prematurely stopped (underpowered)
Virtually all tools that report statistical significance that I’ve looked into did NOTHING to correct for the multiple comparisons problem
Even the otherwise good Qubit paper I’ve mentioned above fails to grasp multiple testing properly. It mentions it is somewhat of a problem, but doesn’t mention any significance level adjustment/correction methods. Instead, it recommends keeping test variants to a minimum and sticking to what is expected to provide significance effects.
While I won’t suggest that we engage in pointless scattergun approach to testing, I think multiple variants testing is a completely fine approach when done properly. In conversion optimization we can usually limit the number of tests we do, but it is often impossible (and incorrect) to limit the number variants/groups analyzed when doing observational studies. For example if we are examining different characteristics of multiple traffic sources and their performance. Remember, working with a non-randomized subset of the available data on an issue is often cherry-picking.
Most Statistical Significance Calculators are Useless or Flawed
Given the above issue, I deem it fair to say that all free online statistical significance calculators and Excel sheets that I’m aware of are either useless or flawed. Most of them only calculate significance for one variations against a baseline. Thus they are impractical for actual usage, as most often we need to compare several ad copies or several landing pages or many traffic sources, keywords etc. when looking for ways to improve their performance. As explained above, using such a calculator and doing multiple pairwise comparisons opens you up to a much higher error than the one reported by the calculators.
Other calculators work on one baseline and several variants. That’s better, but most do not even try to correct for multiple comparisons and this can have visible effects even with only three variants, as demonstrated above. There is one attempt I know of to make a good significance calculator that works on many variants and corrects for errors introduced by multiple comparisons. However, its correction method (Dunnett’s) is unnecessarily strict for practical applications.
Thus the only calculator I know of that handles multitude of variants and corrects in a practical manner for errors introduced by multiple comparisons is currently our statistical significance calculator. It was used to produce the example above.
Enough, I Get It! But What To DO About All This?
I understand it might have been hard for many of you to reach this far into this post, but if you did: congratulations, you are one tough marketer!
So here is what I think you can do if you are passionate about improving the quality of your work through real data-driven decisions:
#1 Learn about experiment design and hypothesis testing. Apply the knowledge in everyday tasks.
#2 Learn about statistics and statistical inference, and apply the knowledge in everyday tasks. Things to avoid: Bayesian inference, the Bayesian statistical approach (nothing against the Bayes rule itself).
#3 Demand better statistics from your tool vendors. Reach out and let them know you care about good data and good decisions. They will listen.
#4 Educate your boss, your customers and whoever else benefits from your work about your newly found superpower and how it can work for them.
An awesome quote is due: “Being able to apply statistics is like having a secret superpower. Where most people see averages, you see confidence intervals. When someone says “7 is greater than 5,” you declare that they’re really the same. In a cacophony of noise, you hear a cry for help.” (Evan Miller)
In this blog we will try to help with #1, #2 and possibly indirectly – for #4, by publishing more (shorter!) follow-up articles like this one on the Simpson’s paradox in online marketing. They will cover some of the concepts in this article more thoroughly, there will be posts on how to setup a test properly, how to choose significance and power levels, sequential testing, analyzing data in Google Analytics or Google Adwords and basically everything that has to do with statistics and online marketing. Our first pieces will likely be on how to choose your power and significance levels, so stay tuned!
I started writing this post with a great deal of excitement, because I’m really passionate about good marketing for good products & services and I am passionate about achieving real results with what I do.
The purpose of this post is to raise awareness of what I consider to be crucial issues for the whole internet marketing industry and for everyone relying on our expertise. I believe I’ve successfully proven there are issues and I’ve shown how deep those issues go, how pervasive they are and how dangerous their impact is. I think I’ve made a strong case as to why every marketer should be a statistician. Let me know if you agree or disagree with me on this.
I’d like to cite another insightful thought from Evan Miller:
One way to read the history of business in the twentieth century is a series of transformations whereby industries that “didn’t need math” suddenly found themselves critically depending on it. Statistical quality control reinvented manufacturing; agricultural economics transformed farming; the analysis of variance revolutionized the chemical and pharmaceutical industries; linear programming changed the face of supply-chain management and logistics; and the Black-Scholes equation created a market out of nothing. More recently, “Moneyball” techniques have taken over sports management…
If you want to help raise the awareness about these issues and thus turn internet marketing into a better, more fact-based discipline, please, spread the message of this article in any shape or form you see fit. And I appeal to all online marketing tool vendors out there – please, try and make it easier for your users to base their decisions on facts. They’ll be grateful to you as whoever has the better facts has a better chance to succeed.
Founder @ Analytics-Toolkit.com
Full disclosure: I am not a certified statistician. This article contains “popular” usage of some statistical terms, particularly “significance” and “certainty”, in order to make them more accessible to the general public.
Why Every Internet Marketer Should Be a Statistician by Georgi Georgiev