Over the last few months, webmasters across the internet have been seeing an increase in a new wave of Google Analytics spam - language spam.

Language spam is a type of attack that spammers use in an attempt to direct users to a fake domain, due to it's prominent location on the Google Analytics dashboard. The language field is usually reserved for short abbreviations of language codes, such as "en", "en-gb", "fr", etc, so it's quite noticable when something out of the ordinary appears there such as a long message.

Since November earlier this year, many people have been reporting seeing this new kind of language spam increase in their metrics, namely a message appearing in the lead-up to the 2016 US Presidential election:

Secret.ɢoogle.com You are invited! Enter only with this ticket URL. Copy it. Vote for Trump!

The traffic that the spammers generate is quite notable on the Analytics dashboard, as it diverges from the usual referral spam we see - while having a fairly average bounce rate it can show very long session durations - sometimes over 20 minutes. This clearly is a problem for anyone trying to report on their site traffic and conversions (Although this kind of spam usually only appears on your homepage, so most internal pages should not be affected).

That being said, this kind of spam doesn't really differ from the usual referral spam in how it's generated. Generally a spammer will set up one of two kinds of bots (computer programs) to conduct this attack. Either a bot will actually visit your website and imitates a user browsing through, or a bot that doesn't actually visit the page but instead sends fake "hits" directly to the Google Analytics servers.

So what can we do about it?

Unfortunately, not a great deal can be done to completely solve this issue. Google has known about this kind of spam since as early as 2013, though without a dramatic overhaul of their core systems behind the tracking part of GA it's unlikely we'll see a complete fix. Luckily, Google Analytics provides all the tools required to filter out contaminated metrics and display more realistic numbers.

The first part to limiting the amount of language spam you see on your dashboard is by using GA's view-level filters. In the Admin panel on your account, click the Filters tab and add a Custom filter to Exclude the filter field: Language Settings, and input the following regular expression:

.{15,}|\s[^\s]*\s|.|,|!|\/

Because any legitimate language data will often be at most be five or fix characters/symbols, this expression will filter out any traffic where the language code appears as more than 15 or more symbols (including invalid characters).

You can then use the "Verify Filter" option to view how it will affect your metrics from the last few days. Because the filters only begin working from the time you create them, you can also filter out your historical data using an Advanced Segment. Using the same regular expression, you can create a new Segment and limit the Demographics to filter any Languages that do not match the above expression.

Once you've set the filter and a Segment up, you can save and add shortcuts to your dashboard (if you require them). Hopefully you won't be seeing any more crazy vote for Trump spam, among others. It's far too early to tell if this is the last type of language spam we'll see in GA, and the rule of thumb says it won't be. However with this fix you should be well positioned to avoid most of it in the near future.

Previous Post