This article deals with the problem of Google Analytics referrer spam (prominent referral spammers are semalt, darodar, buttons-for-website). I explain why and how it actually happens, why currently proposed solutions fail and I describe a working, albeit not perfect solution for the issue.
Contents (click to jump to that part):
- The problem of Google Analytics referrer spam
- How does referrer spam work?
- How NOT to fix referrer spam
- A working solution (part 1)
- Solution part 2
The problem of Google Analytics referrer spam
A lot of webmaster forums and blogs are lit up by a fairly new issue: Google Analytics referrer spam. That’s basically phony traffic appearing in your analytics reports that says that users came to your site referred by a domain that doesn’t really have a link to you. You, curious of the traffic, which often also has a 100% bounce rate, go to the referrer URL, only to be bombarded by ads or worse: malware, sneaky affiliate redirects, etc. Some of them want to sell website owners, hungry for traffic, their online marketing services.
Here is what it might look like for you:
Here we see some of the worst offenders: semalt, darodar and buttons-for-website. If you are seeing referrer traffic from such domains, be sure that this is referrer spam. Ideally you SHOULD NOT VISIT those URLs to check them out. If you visit, make sure to have a good anti-virus software or use a virtual machine.
Why is such traffic a problem? First, it wastes your time, just as plain old e-mail spam. Second, it pollutes your stats and skews your metrics, especially if you have a site with fairly little traffic (which is most sites). Third – see above.
How does referrer spam work?
Furthermore, not so long ago Google announced the Measurement Protocol which provides an interface for a server to communicate to Google Analytics directly via an official protocol. This was possible before, but was not documented. It’s intent is to allow companies to connect different systems to Google Analytics and to collect offline data to accompany the data, collected by web surfers. Examples of systems that would use the protocol are phone-tracking solutions, POS systems, offline reservation systems, CRM systems, etc.
However, since there is no authentication in this whole process, virtually anybody can send hits for your website, just by knowing your UA-ID (UA-XXXXXXX-XX). This is exactly what spammers are using to pollute your data.
Here is a scheme of how all this works:
Everything can be spoofed (forged) during these interactions. Hostname? Yes. Referrer data? Yes. Campaign tracking data? Yes. URL paths? Yes. And so on, you get the idea. This is well documented in the Measurement Protocol parameter reference guide.
As you can see on the scheme above, a spammer doesn’t even need to connect to your server in order to push his referrer spam. He’s connecting straight to the Google Analytics servers, thus leaving no trace of his activities on your side. This also makes it impossible to cut him off by standard security measures on your side.
How NOT to fix referrer spam
Now that we know how the spam is actually done, let’s quickly go over some solutions that I see proposed at several blogs and forums and why each of them will not help you:
1. Apply a GA view filter in by “Referrer” to exclude traffic from these referrers.
-> This doesn’t work as they are not setting the referrer field, rather the Campaign Source field.
2. Apply a GA view filter to allow only traffic that originated only on your hostname (yourdomain.com).
-> This doesn’t work as they can easily set the hostname to your own. And they actually do:
As you can see, most of the spammers faked the hostname with ease. The rest will soon follow…
EDIT: see Solution UPDATE below.
3. Exclude the traffic from the “Referrer Exclusion List” (if you are using Universal Analytics).
-> Referrer Exclusion doesn’t work like that. It’s for preventing sources from starting a new session while a current session is active, e.g. third-party-hosted shopping or payment solutions like PayPal.
4. Any kind of server traffic monitoring and attempts to exclude traffic on the server level (htaccess, firewall, etc.): by user-agent, by referrer, by IP…
-> Since no interaction with the server is required for the spam to occur, these are all pointless.
5. Use the bot-filtering feature in Google Analytics
-> Tested: doesn’t help one bit with those referrers and I don’t think it was created with such intentions in mind.
6. Exclude whole geographical regions that are deemed responsible for the spammy traffic
-> A drastic solution, but sadly it’s likely that it won’t work, since the geographical data can also be spoofed via the Measurement Protocol. Also, such an option is not available to all website owners.
A working solution
(Last Update August 6-th, 2015 – updating further is impossible due to large volume of spammers and the multiple, constantly changing filters required to keep them out. Use as a starting point only.)
Currently, the best solution is to apply view level filters to exclude referrer spam (not retroactive) and also report filters and advanced segments (retroactive). You can use this regular expression to block a few of the most annoying spammers out there (pretty much every site I have access to is affected by at least two of these):
You need to construct a filter as shown in the screenshot bellow:
Or, alternatively, you can construct a custom dimension using the same expression. If you spot a new referral spam domain you need to add a vertical line (“|”) and then the name of the domain. Escape dots by adding a backslash (“\”) before them. As you can see I prefer not to add the whole domain, but just the unique part of it, but this would vary between domains.
Unfortunately, as you probably understand, this is not a permantent solution, since we are ultimately entering a game of exhaustion. How fast and how many spam domains can the spammers register? How fast can you add them to your filter’s regex? How much productive energy will be lost in this uneven battle: it takes a spammer an hour to set up a new domain and start spamming; it takes hundreds of thousands of analysts/marketing specialists/webmasters significant time to weave through stats, to identify suspicious traffic and to update filters accordingly… This is why that’s only a short-term solution.
If you are managing more than a couple Google Analytics accounts we would strongly recommend checking out our fully automated tool for tackling the problem: Auto Spam Filters tool . It eliminates & protects against referrer spam & other ghost traffic. It’s a set-and-forget, 1-click solution that works across 100s of properties and views. The filters are frequently updated for protection against new spammers that inevitably show up.
UPDATE: Solution Part 2:
(April 29-th, 2015)
Since referrer spam is getting worse day by day and Google has not come up with anything to help us deal with it (see the end of this post and my other post on the issue linked below as to why) and we are forced to manage this avalanche ourselves, I’ve started deploying an additional filter to client’s accounts in order to prevent some of the new spam from coming in. This filter INCLUDES ONLY traffic with the hostname field set to a predefined set of values. Yes, it can be spoofed, but the less sophisticated referrer spammers that just leave this field to (not set) or who set a random domain name here will be filtered out. And there are a good enough number of these to justify setting up such a filter.
Here is how to set it up:
1.) Go to your Hostnames report and select a date range of a month or more. This will show you the hostnames you need to consider for inclusion.
From the screenshot above it is obvious that I only need to include www.analytics-toolkit.com for this particular view. I would not recommend including translate.googleusercontent.com in most cases even if that means losing some stats, as more adept spammers will just use this hostname to bypass the filter, without the need for a cralwer or a third-party UA-ID to hostname database.
2.) Construct and apply the filter.
You would want to use some basic regex here. For example, if I want to include all traffic to analytics-toolkit.com and a third-party shopping cart, hosted on shopify.com for example, here is what that would look like:
DO NOT APPLY the above filter to your site without modifications!
It is very important to get this filter right, as otherwise it might result in missing statistics. Also, it does require some mentainance: e.g. when moving domain names, when adding new domain names (third-party hosted shopping carts for example), etc.. As always, keep a view with the raw data just in case.
Again, if in your line of work you are managing more than a couple Google Analytics accounts (e.g. you are a digital agency, CRO agency, etc.) we would strongly recommend our newly launched fully automated solution to the problem. You can now use the Auto Spam Filters tool to eliminate & protect against referrer spam & other ghost traffic. It’s a set-and-forget, 1-click wonder that works across 100s of properties and views. The filters are frequently updated for continued protection.
Is a long-term solution even possible?
If an authentication mechanism can be used to authenticate who can send data to which UA-ID, then, yes. However, the implementation of Google Analytics – no matter if it’s the usual JS tracking or via the Measurement Protocol ultimately relies on an unidentified client machine to send the HTTP request. Thus, any such request can be spoofed/forged and there is no workaround for this that doesn’t require altering the very core of the Google Analytics tracking functionality in a very, very significant way.
Google knows about the issue from as early as 2013 (likely much earlier) as confirmed by a reply from Nick Michailovski – one of the core people in the Google Analytics team, in a Google Groups thread. His exact quote:
We’ve looked at this. If you are doing a client-based implementation, as long as you can get the http request from the client, then you can spoof the request.
So unless Google decides to dramatically change the very core of the tracking part of GA, we are left to deal with the issue of referrer spam ourselves. And there is precious little we can actually do…
Does it end with referrer spam?
The exploit that is used to send tons of referrer spam with ease can be used for much more malicious intents. The integrity of ALL your Google Analytics data is at risk and it appears that a willing and able attacker has vastly more options to attack than you do to defend yourself.
However, this is a topic for a whole new article: Malicious Attacks on Google Analytics Data Integrity.
Guide to Removing Referrer Spam in Google Analytics by Georgi Georgiev