In a previous post about referrer spam in Google Analytics I hinted that a much bigger issue might stem from this fairly harmless attack. The problem I present in this article extends to basically all Google Analytics data, for any website out there, no matter how big or small. The problem is how to preserve the integrity of all of your Google Analytics data from cyber vandalism, malicious pollution and distortion.
Why are attacks possible and how they work?
Furthermore, not so long ago Google announced the Measurement Protocol which provides an interface for a server to communicate to Google Analytics directly via an official protocol. This was possible and done before that, but was not documented. The purpose of the Measurement Protocol is to allow companies to connect different systems to Google Analytics and to collect offline data to accompany their website visitors’ data. Examples of systems that would use the protocol are phone-tracking solutions, POS systems, offline reservation systems, CRM systems, etc.
However, since there is no authentication in this whole process, virtually anybody can send hits for your website, just by knowing your UA-ID (UA-XXXXXXX-XX). Google analytics will accept the data as it is, putting it in your reports. This is exactly how referrer spammers are currently polluting your data with their “ads”. This is also what a potential malicious attacker would use to compromise your data integrity, effectively carriying out a data pollution attack.
Here is a scheme of how all this works:
Literally everything can be spoofed (forged) during these interactions. Hostname? Yes. Referrer data? Yes. Campaign tracking data? Yes. URL paths? Yes. Geo-location and IP address? Yes! And so on, you get the idea. This is well documented in the Measurement Protocol parameter reference guide.
The worst part of all of the above is that a spammer doesn’t even need to connect to your server in order to pollute your Google Analytics data. He’s connecting straight to the Google Analytics servers, thus leaving no trace of his activities on your side. This also makes it impossible to cut him off by standard security measures on your server, like firewalls and htaccess rules.
Possible Scenarios for Malicious Attacks on Google Analytics*
If something can be abused, it will be abused. It’s just a matter of time. So I believe it won’t be long before cyber vandalism or even malicious, targeted attacks on a businesses data integrity happen. Maybe they are already happening to unsuspecting victims.
The current attacks are by spammers that only try to “advertise” their websites/domain names. They want you to see them and go visit their site. Much like the early years of hacking when an attacker was sure to announce that he broke through your security, e.g. by defacing your site.
However, a malicious attacker can also spoof GA data in a way that is virtually undetectable by the owners of the Google Analytics accounts themselves. E-commerce data, goal conversions, events, pageviews, referrals. Everything. Every dimension can be forged and every metric can be influenced in a direction of choice.
This includes the AdWords data, the main reason we have Google Analytics in the first place. Fake visits with zero engagement and conversions can be sent, thus undermining the performance of the channel. This, of course, applies to any other channel.
Spikes and dips in traffic can be caused pretty much at will. They can be from targeted traffic sources, or to targeted pages, or spread across many sources and landing pages. Spikes and dips in metrics like bounce rate, session duration and pages/visit can be caused just as well. Analysts and marketers would waste tons of man-hours trying to understand what’s happening. Such activities may cause many people to lose their sleep and maybe even their jobs, in some cases.
I’ve come across cases where marketing companies sign up clients on a performance base fee, measured by traffic or traffic from a specific source. In such cases, manipulating the data will cause the business owner to pay for thin air and to wonder why people are coming to the site, but nobody is buying. And while businesses who have historical data and are tech-savvy will realize that this traffic is not like their usual traffic in a month or two, such a tactic can be especially devastating if applied to a new business with no prior history to compare with.
These are just the first scenarios that came to my mind and I’ve not given this too much thought (like an attacker would do).
How easy is it to carry out a data pollution attack?
Very. A single average-parameters machine with internet connectivity can pollute the data of a site with tens or hundreds of thousands of users per day. If you have a really huge site, causing significant damage might require a bit more, but the payoff is much bigger, too. An attacker need not worry about using proxies or bot networks, because it’s unlikely that Google with start investigating such cases any time soon.
In terms of technical difficulty: Getting your UA-ID takes 30 seconds at best. It would take an hour to write a good script that pollutes your data with one fake hit each time it’s run. It would take an attacker at most a couple of days to write a script that permutates the attack hits in such way that it’s impossible to distinguish those from legitimate hits from end users. Then this script can be used on an unlimited number of targets, or it can be sold to other attackers for profit.
What defensive options do we have?
First: detection and prevention. An attack is impossible to detect before it happens. Thus it’s impossible to prevent it. Every parameter can be faked, including IP address, network location, hostname, which are usually used for detection in such scenarios. There is, as far as I understand it, no single piece of information that you can see in a Google Analytics report that can be used to detect an attacker or to distinguish an attack hit from a regular one.
The only option to detect an attack is after the fact, when you see unusual patterns in your data: unusual referrers, spikes in traffic, strange metrics, etc. And this only happen if you have solid historical data to compare with and other analytics tools (based on server logs and other backend stats) that may hint you to it as well. If the attack is carried intelligently enough you might not even be able to do that. Also, even when you can tell you are a victim, you still can’t distinguish the good data from the bad.
Second: clean up. Simply impossible, since you can’t distinguish the garbage data from your real data. Or at least I’m not currently aware of a way to do that.
Is a long-term solution even possible?
A long term solution for data integrity / data pollution attacks is possible only if an authentication mechanism can be used to authenticate who can send data to which UA-ID. However, the technical implementation of Google Analytics – no matter if it’s the usual JS tracking or via the Measurement Protocol ultimately relies on an unidentified client machine to send the HTTP request. Thus, authentication is impossible in the current setup. So any such request can be spoofed/forged and there is no fix for that that doesn’t require altering the core of the Google Analytics tracking functionality in a very, very significant way.
It appears Google knows about the issue from as early as 2013 (likely much earlier). One of the core people in the Google Analytics team, Nick Michailovski, replies to such a question in a Google Groups thread. His exact words:
We’ve looked at this. If you are doing a client-based implementation, as long as you can get the http request from the client, then you can spoof the request.
So unless Google decides to dramatically change the very core of the tracking part of GA, we are left to deal with the issue of referrer spam and malicious data pollution ourselves. And there is precious little we can actually do…
* The issue is not limited to Google Analytics only. As far as I’m aware other tracking solutions such as Omniture, Piwik, OWA, etc. also rely on a very similar technical approach. Thus, they are most likely all prone to the same exploit.All Your Google Analytics Are Belong To Us by Georgi Georgiev