23:21 Mon 28 Feb 2011
Since adding reCAPTCHA to my comment forms, the amount of spam comments I’m getting has dropped. Initially it dropped to almost nothing, but now it’s back up to several a day, which is annoying but not unmanageable. Still too high, and it’s one of the reasons why I don’t have email on my phone.

Before the current point of manageable spam was reached, however, I had accumulated 15,000 pending comments for my blog.

Comment traffic on my blog isn’t too high, and I can’t stand the thought of any spam showing up on it ever, so I moderate every comment that comes through. If it’s not spam, I’ll post it. When I was being inundated with ridiculous amounts of spam, however, I stopped marking them as spam and just ignored them, leaving them to sit in “pending”. Now, while the situation is more tractable, I want to deal with them.

I don’t want to just delete all of them before a certain date, however, because I still worry about wrongly deleting a comment that I should have approved. I’m aware that this is really an issue of control-freakery, but that doesn’t change my course of action.

At first I was steadily reviewing and deleting about 100 per day, but that was too slow. Today I tried reviewing and deleting 500 at a time, but WordPress and Apache both have problems with that, as the size of the URL ends up being unmanageable. I seemed to be able to get about 87 at a time through the admin interface. To make it slightly more interesting, I started searching for spam terms that seemed common within the pending comments: “gold”, “sex”, “money”, “cialis”, “viagra”, “forex”, “market”, “porn”, and “fleshlight”. Most of those had over 100, and I think that “gold” (mostly World of Warcraft-related), “money”, and “sex” were the most common. However, none of them got as many as I expected, and after going through that I still have just over 12,000 pending comments.

Over 12,000 spam comments that don’t contain any of the above strings. I don’t see any for e.g. “v1agra”, either. I admit to having naively thought that the bulk of the comments would be about those major topics. Instead, the spam seems to have a “long tail”, and so we have LeBron James jerseys, auto repair sites, Egyptian love potions, dentists, and blast cabinets being promoted through spam comments.

It’s possible that spam comments are even more annoying to deal with than spam emails. Email spam, with the possible exception of some higher-quality phishing attempts, doesn’t generally masquerade as something else. Or at least it doesn’t try to masquerade as something that’s not selling you something, since selling you something is the point. Spam comments, however, try to masquerade as genuine comments on the topic at hand, so that people like me won’t delete them and so that search engines treat them as if they’re content of the same quality as the rest of the page. And they do this very, very badly, layering an insult to your intelligence on top of their other negative characteristics.

I’ve long thought that the vast majority of advertising is cultural pollution, but spam comments are particularly awful. Masquerading as legitimate discussion participation just to push links to whatever awful crap they’re hawking destroys the original discussions unless someone takes the time to deal with it. I’ve seen plenty of extremely useful blog comments become lost in a sea of spam, and it’s always struck me as very sad: the available technology could in theory have kept the possibility of meaningful communication in that space alive for years, even after the original author has abandoned the blog, but instead automated spam descends and destroys that potential.

Perhaps most maddening of all, this ecology generates a fair amount of noise—yes, even something as low-signal as spam comments can have comparatively noisy segments—where either a spammer is testing out software, so that the spam comments don’t have any links but do have alphanumeric identifying strings, or the spammer is entirely incompetent, and instead of URLs are blank lines or “[[insert link here]]” and the like. So in those cases there’s barely any marginal advantage whatsoever to the polluter.

Of course, on that note, one of the things that irks me most about this is that there actually isn’t any marginal advantage whatsoever to any spammer in spamming my blog particularly. I would sooner shut off comments entirely than let spam on my site, so all they’ve achieved is creating a lot of work (and a lot of negativity) for me without ever helping their own cause in any way.

Yes, I’m aware that they don’t know this, and that it’d be more work for them to exclude my site, and that I really shouldn’t take it personally, but I find that quite difficult. Especially when they switch to new tactics that are still ineffective and pathetic, like simply ripping news headlines and stuffing them into comments, or attempting to match what they think the post is about to their comments. You’d think with that many comments some of them would work just by chance, but I think that’s only happened once, and then only because a) I’m pretty sure a human agent made the comment, rather than it having been automatically generated, and b) I actually have a very lenient policy; if it seems at all like it might not be spam, I’ll publish it.

I should be happy that it’s ineffective, right? But somehow it doesn’t work that way, somehow I find it more angering when it’s stupid. Again, there’s that aspect of taking it personally and thinking, “why are you even making me bother with this?”—which is a dumb question to ask. More typical advertising brings a similar reaction; I dislike it more when it’s dumb, and can appreciate to a certain extent the artistry involved in making it subtle and effective, even if I despise its effects. But spam, of course, works on rather different principles, and artistry would really be beside the point.

    YMMV but a combo of SI CAPTCHA Anti-Spam and Akismet brought my spam problem down to zero.

