I’ve been logging attempted WordPress spam on one Web site.  I grabbed the 500 most-recent spams out of the database to which they’re banished, and took a look at common characteristics with an eye toward the best strategies for detecting and rejecting spam.  During the time period of the 500 spams, none of them made it through the spam filters, but I’m tweaking the filtering a bit to optimize the opportunities to squish the nasty stuff.  Let’s suppress our gag reflex and look a little closer at this sewage to understand the slime and odors a bit better.  (You can tell I’m not a big fan of spam or spammers.)

First off I’ll mention the Web site we’re looking at is configured to not allow pingbacks and trackbacks (which, if allowed, would account for about 10% of the spam).  Also, all comments are moderated, never automaticlly accepted, so even if our filtering fails, a review will enable the rejection of the few that might sneak through.  Although, none did in this batch.

1) 45.6% IP Address Repeat Offenders

There were 228 spam attempts originating from IP addresses from which previous spam attempts were made.  Our policy is to automatically reject comments as spam if there were 2 or more spams originating previously from the same IP address within the last 90 days.  This rule is highly reliable but probably not perfect. A spammer may operate through a proxy, poisoning it for legitimate commenters going through the same proxy.  We haven’t seen any case of that yet so we’ll continue to ban based on recent IP address offenses.

We also have the ability to always block from certain addresses or ranges no matter the circumstance.  We get a few of those occasionally, usually originating from eastern European countries. The blog doesn’t have a high Latvian readership so we feel comfortable automatically rejecting comments originating from IP addresses assigned to them.

2) 7.2% Author Names Too Long

The typical comment form specifies a length of 30 characters in the INPUT tag for the author’s name.  We found 36 spams that arrived with those values in excess of 30 characters long.  This is a clear indicator of automated form submissions.  They are undoubtedly spam.  A few email values were too long as well (4 or 0.8%) so those were rejected too. A quick check using the Lynx text browser shows that it enforces the input limit as well, so this is a relatively safe and reasonable rejection criterion.

3) 4.8% Bad Referer

The referrer (sic) sent along with a comment submission should be the same as the blog post itself.  In other words, a commenter presumably brings up a blog post page and uses the form to submit a comment on that page.  Automated spam doesn’t always do this properly so those can be summarily rejected.  Missing referrers are fine and don’t count against the submission–they may or may not be spam.

4) 15.2% Bad Field Names

During the time period that these spams were submitted, the comment forms had a single name change to one of the fields.  This was enough to trip up 76 of the submissions which used the WordPress default name rather than our switched name. Straight to the spam pile!  Since then, the comment forms have been further changed and, so far we have 100% success in rejecting spam just on that single criterion alone. Furthermore, the spam rate has dropped substantially so something about the reworked forms is really throwing automated spam off.  Too bad.

5) 6.2% Bad Author URL

We removed the author URL field from the standard comment form but that didn’t stop the value from showing up from some automated spammers–straight to the rejection heap for 31 spam comments.

6) 6.6% Multibyte Content

All or part of 33 comments had multibyte characters (typically Russian, Chinese, or Arabic languages).  This blog is in American English and, in keeping with traditional proud American ignorance, they were all rejected out of hand.  I did look at a few via Google Translate and all were clearly spammy so no loss.  More cosmopolitan blogs would be advised to be less arbitrary.

7) 0.6% Entity Code Disguise

Those of us with devious minds would think that entity-encoding certain characters would be a good way to bypass word filters.  Surprisingly in this sample, we found only 3 that did that.  But the fact that they did do that was enough to brand them as too clever by half–off with their spammy heads.

8) 0.8% Entity Name Disguise

In the same vein as entity code disguising mentioned previously, entity names can be used in words to disguise them from word filters.  We found 4 in the sample of 500 spams and 2 of them overlapped with the entity code disguise mentioned previously.  These were not so easy to reject outright because one could make a good case for certain blogs, such as technical blogs, to use entity names for Greek characters for example.  If your blog is about Cajun cooking, though, there’s probably little reason why a Greek omicron should appear in a comment about a recipe.

9) 94.4% Missing Metadata

The site base has some extra functionality added to aid in gathering some very simple view stats and that functionality uses JavaScript AJAX to do its thing.  Some extra handshaking was added to do some user comment form sanity checking. A whopping 472 spam comments were missing that metadata (indicating the JavaScript code never ran) and zero legitimate comments were missing it. This is a very reliable indicator of suspicious automation activity.  It’s not perfect though.  Vision-impaired users could legitimately access the site without the JavaScript running because of the text browser tool they are using.  But, given the ubiquitous presence of JavaScript for all kinds of presentation and function purposes, comment form handling is but one small part of the problem for text browsers.

By the way, 2 of the submitted comments included the JavaScript-managed metadata but used wrong data indicating an attempt to finesse the processing.  The metadata content would be very difficult to fake without understanding what the data pieces were, how they were encoded, and how the signature hash was generated.  So anything wrong with the metadata is a guarantee that somebody is trying to get away with something.

10) 59.6% High Spam Score

A simple spam scoring mechanism is used to check the content of comment submissions for certain words and phrases that commonly appear in spam.  Even with this fairly simple, and by no means comprehensive, scoring mechanism, 298 comments went over the arbitrary spam score limit.  There were no false positives in this sample so this is a very reliable indicator.  Even if somebody had some legitimate reason why Ugg boots had to be mentioned in a comment, I don’t feel bad it gets rejected anyway.  As a spam-loather, I despise Ugg boots.  Anybody who has dealt with spam knows why.

11) 82.8% Bad User Agents

A whopping 414 comments included user agent strings that looked suspicious.  Most were suspicious because of the “MSIE 6.0” substring. Who uses IE 6 any more?  It’s likely just a lazy carry-over from when the automated spam software was developed.  Other suspicious user agent strings included URLs, references to PHP, or included the substring “FunWebProducts” (an old obnoxious ubiquitous spam-ware infestation common years ago).  It’s nice to be able to reject comments based on these user agent strings but, if the practice becomes common, it’s ridiculously easy for spammers to update to more current bogus user agent strings.

Lessons Learned

Although our WordPress spam-handling methods have been nearly 100% effective, the analysis of these 500 spams shows that a few things we were doing don’t accomplish much (e.g. items 7 and 8 above).  Some we weren’t doing but have now adapted (e.g. items 2 and 11).  Our current strategy is to reject comments for any one of several offenses (items 1 through 5 above).  Then, if things still look good at that point, the code scores the remaining tests (items 6 through 11) and if the comment fails any, it’s put in the WordPress spam queue for review unless it reaches a 3-strikes-and-you’re-out limit, in which case the spam is rejected outright.

We need to keep cognizant of usability issues and the nature of various blogs being more conducive to comments of one type versus another and our spam-testing code balances that.  Further, we need to be aware that spam changes constantly as software adapts or new jerks get into the spamming business–the patterns are always changing.  Finally, we have to be aware that many of the tests we do are for things the spammers just dropped the ball on, for example items 2-6 above.  Most are very easy for spammers to fix if they knew of the problem but meanwhile, it’s a simple way to filter out the scum.

So far we haven’t had to resort to things like CAPTCHA or quiz questions (which are an annoyance for legitimate users and now easily circumvented by spammers).  If we all can hold on a little longer, maybe spam customers will dry up as they realize that all this spamming isn’t helping them anyway, even when the spams are successful.  If anything, they’re just antagonizing people.

Spam Strategy Going Forward

We will continue to log spam attempts and, perhaps, occasionally do an analysis like this again to see if adjustments need to be made or new opportunities for filtering present themselves.  There are other more compute-intensive tests we can perform if it comes to that. For example, we’ve been experimenting with a relevancy-scoring test that tries to determine how likely a comment actually pertains to the post the comment is directed to.  That, and some other prototype tests, will be kept in the warehouse until we see spammers cleaning up their acts in other regards.