A List Apart

Menu
Graceful E-Mail Obscuration Issue № 248

Graceful E-Mail Obfuscation

by Published in JavaScript, The Server Side, Accessibility101 Comments

In “Win the SPAM Arms Race” (A List Apart, May 2002), Dan Benjamin talked about the importance of hiding e-mail addresses on our websites from vicious, e-mail address harvesting bots—or spam bots, as they are more often called. Dan pioneered a JavaScript-based solution for bypassing the indexing mechanisms that spam bots use. Here’s a quote from the article:

Posting a naked e-mail link anywhere on the web (or in a newsgroup, in a chatroom, on a weblog comments page…) is generally the kiss of death for your once-healthy address.

It’s hard to believe, but it’s been more than five years since Dan wrote these words. So, did we win the SPAM Arms Race? As you may have noticed by looking at your own inbox recently, not exactly. The Messaging Anti-Abuse Working Group (MAAWG) estimates that 90 billion spam messages are sent every day, and 80–85% of all incoming mail is abusive.

A shared responsibility

Many web users don’t understand the inevitable consequences of exposing their e-mail address on the web. Experienced web developers and website owners, however, do. Thousands of spam bots tirelessly crawl the web to collect e-mail addresses exposed on websites, in blog comments, and elsewhere. These addresses end up in databases sold to unsavory marketers, who bombard the owners’s inboxes with unsolicited mail.

Of course, spam is an increasingly complicated problem that can never be solved by the efforts of web developers alone. But don’t underestimate your own powers.

An unpleasant surprise

I work for a large non-profit organization that provides social services for the blind and visually impaired. After Wim, our system administrator, complained about the massive amounts of spam our mail server had to process, we started a small investigation. It turned out that 90% of all spam was sent to a mere 5% of the e-mail addresses we own, and guess what? They were exactly the addresses that had been published on our website.

Although most of the damage had been done by then (remember Dan’s quote), I promised Wim I would come up with an effective way to protect the addresses on our upcoming portal, on which we intend to publish even more addresses.

My solution would need to defeat spam and be accessible. We work intensely with and for people who have (mostly visual) disabilities. Accessibility is not an optional add-on.

A few months ago, Wim very unexpectedly passed away (we miss you, Wim!). Since then, I have spent a lot of time thinking about a way to fight spam bots. In this article, I’ll share my ideas on the subject and leave you with a working script to build on or to use in your own projects right away.

The problem with current techniques

Wikipedia has an excellent overview of anti-spam techniques. Their article also includes interesting links to articles about e-mail obfuscation. (Google the subject for more). Over the years, I’ve tried more than a dozen of these techniques. Although most seem effective, I can’t use them in my projects, as every one fails to meet one or more essential requirements. My requirements are:

1. No hassle, please

You’ve certainly seen e-mail links that look like “mailto:contact_removethis@company.com” or “mailto:contact(at)company(dot)com”. If you’re like me, you probably don’t like to correct a deliberately misspelled e-mail address after you click on it. Moreover, users who don’t notice what’s wrong with the address will end up frustrated, because their message cannot be sent or delivered. Similar techniques require users to re-type a (correctly spelled) address that’s rendered as an image—which isn’t any better, of course.

Although they don’t require JavaScript, these methods of e-mail obfuscation add an unpleasant barrier to a task as trivial as sending an e-mail. Clearly, this is not the right way to treat visitors or (potential) customers. I want real, clickable e-mail links that work just as expected, but—at the same time—are immune to spam bots.

2. Graceful degradation

JavaScript-based techniques—like Dan’s—offer the seamless user experience I’m looking for. They’re all based on the simple fact that spam harvesters are incapable of parsing JavaScript or understanding DOM changes initiated by JavaScript events. Instead, spam harvesters try to extract e-mail addresses from raw HTML by using brute force algorithms—even Googlebot chokes on most of the JavaScript it comes upon. Only real browsers know how to handle JavaScript and can undo the obfuscation—either by stitching together [removed]s or by using a more advanced, unobtrusive, event-based approach.

An important downside is that such solutions are not bulletproof. Visitors who surf the web without JavaScript support—whether by choice or not—are out of luck, because they’re treated as spam bots. These visitors include people using text browsers, old or incapable screenreaders, or mobile devices with limited capabilities. Other users have JavaScript turned off for security reasons or because of company policies. W3Schools estimates that 6% of internet users have no access to JavaScript as of January 2007. As a comparison, if you believe that’s not enough to really care about, then maybe it’s time to reconsider why you strive to make your markup and CSS accommodate the 1.5% of IE 5.x users or the 1.3% of Safari users (again, W3Schools).

3. Install and forget

Most e-mail obfuscation techniques I’ve tried tend to be bothersome and time-consuming to implement because they have to be applied to each and every e-mail address that you want to protect. Most require you to use lengthy inline script elements and inline event handlers. They may also invalidate your markup.

I wanted a transparent and fully automated solution that I can set up once and never worry about again. That’s the only way I can guarantee that all addresses that appear on our website are safe—even the ones that show up in blog comments.

Putting it together

Enough talking. Let’s get our hands dirty.

The ingredients

You’ll need Apache 2 and PHP 4 or later. On the web server, the mod_rewrite module must be enabled and you should be able to set Apache directives through the use of .htaccess files. Most web hosts have this enabled by default, so you probably don’t have to worry about it. For help on these Apache-specific features, check out the Apache documentation.

Put on your masks

Setting up Graceful E-Mail Obfuscation (GEO) involves a few steps. The key is to replace all occurrences of mailto links with innocent-looking URLs. Take this e-mail link as an example:

<a href="mailto:sales@yourcompany.com">
  E-mail our sales department
</a>

After the server-side treatment (I’ll get to that in a minute), that same link will look like this (line wraps marked » —Ed.):

<a href="contact/sales+yourcompany+com" 
rel="nofollow">
  E-mail our sales department
</a>

Let’s just take this one step further and apply some basic ROT13 to it.

<a href="contact/fnyrf+lbhepbzcnal+pbz" 
rel="nofollow">
  E-mail our sales department
</a>

From the results of web exposure tests I did with freshly created addresses, the ROT13 encryption did not seem to be necessary for the technique to be effective. However, it does add an interesting level of obfuscation that certainly won’t do any harm either. If you’re not familiar with ROT13, I should note that it doesn’t add real cryptographic security. Wikipedia offers an accurate description of what ROT13 does:

Applying ROT13 to a piece of text merely requires examining its alphabetic characters and replacing each one by the letter 13 places further along in the alphabet, wrapping back to the beginning if necessary

There are a couple of other things to note here:

  • I choose “contact” as a faux folder name for this example, but you can choose anything you like. To substitute the “@” and the dot in the address, I opted for a “+”. A “+” is typically not allowed in real e-mail addresses and it doesn’t have to be URL-encoded—which will come in handy later on.
  • The rel=“nofollow” part is added to instruct search engines that they don’t need to follow these links and index subsequent pages. Read more about rel=“nofollow” on Microformats.org.

Away with the mailtos! We’re left with plain old hyperlinks. Well, except that they’re broken, maybe; but we’ll fix that soon enough. As you can imagine, there’s very little chance that a spam bot will identify these links as e-mail links—because…they’re not.

The script

To replace each occurrence of a mailto link in a given webpage with a regular URL, I’ll use a PHP search-and-replace regular expression. The URL notation reuses parts of the original e-mail address so that it can be reconstructed later on. For this, we’ll take the entire HTML page as the subject of a PHP preg_replace() function (line wraps marked » —Ed.):

function encrypt_mailto($buffer) {
  preg_replace("/"mailto:([A-Za-z0-9._%-]+ »
  )@([A-Za-z0-9._%-]+).([A-Za z]{2,4})"/","" »
  contact/\1+\2+\3" rel="nofollow"",$html)
}

With ROT13 enabled, the encrypt_mailto() function looks quite a bit longer, as you’ll see in the finalized PHP class that you can download at the end of the article.

Now I want the script to intercept and parse all HTML pages before they’re sent to the browser. I’ll use PHP’s output buffering mechanism for that. In its simplest form, output buffering is activated by using a callback function:

ob_start("encrypt_mailto");

Using .htaccess, plus PHP’s little-known, but powerful auto_prepend_file directive, we can now automate this process for an entire website or for specific folders only. If you add the following line to your .htaccess file, prepend.inc.php will be automatically included at the top of every PHP document that Apache serves.

php_value auto_prepend_file /yourpath/prepend.inc.php

The prepend.inc.php file in itself initiates the output buffering and runs the entire contents of the served pages through the encrypt_mailto() function.

Also note that for this prepending to work properly, you must make sure that PHP code in plain HTML documents (without the .php extension) is parsed by PHP as well. Add this line to the .htaccess file:

AddType application/x-httpd-php .php .htm .html

This might demand a bit more processing power from our web server, but it’s the easiest way to make sure that all our web pages get the server-side special treatment we need. If you’re using a CMS or some sort of application framework, you could opt to cache the server-side encryption.

Fixing the links

Now that we’ve effectively disguised our mailto links, let’s see what happens when someone clicks one of these funny “contact/...” links. Well, except for the Error 404 page: not much.

In the end, visitors shouldn’t notice anything unusual about our e-mail links. A few lines of JavaScript will help us to restore these links into their original shape. But wait: what about those 6% that have no JavaScript support? When JavaScript is not available, our “contact/” URLs will not be “decrypted” on the client side, resulting in a 404 error. Apache to the rescue!

Let’s configure Apache so that its mod_rewrite module will intercept all URL requests that match the pattern we defined earlier. Apache will then derive the segments that make up the e-mail address from the URL and pass them quietly to an intervening PHP script that undoes the ROT 13 encryption and prepares the address for further processing. This is what the Apache rewrite rule looks like (line wraps marked » —Ed.):

RewriteRule ^.*contact/([A-Za-z0-9._%-]*)+ »
([A-Za-z0-9._%-]*)+([A-Za-z.]{2,4})$ »
/yourpath/mail.php?n=$1&d=$2&t=$3 [L]

Note that I had to split the regular expression to fit on this page, but you can download an example .htaccess file at the end of the article.

Providing an elegant fallback solution

Here comes the fun part! Coming up with a safe, elegant and easy to use—or “graceful”—alternative for visitors to send an e-mail when JavaScript is unavailable, is where your own imagination comes into play. How you do it depends on the type of website you’re using it for, but I don’t suggest using a visual captcha for this purpose: it’s quite likely that people who get to see this non-JavaScript page cannot see the captcha image either (either because they’re using a screen reader to compensate for a visual impairment, or because they’re using a text browser).

One solution would be to offer users a simple contact form that allows them to send a message without giving away the actual address. And if your website already uses a contact form, you could choose to redirect “unencoded” mailto links to that page.

In most cases, however, people do want the actual address. So, for this example, I decided to prompt the user with a question that’s hard to answer by a spam bot, but easily enough for humans. If the right answer is given, the script can safely assume that it’s not dealing with a spam bot and reveal the actual e-mail address.

To see how this works, take a look at the demo page I put together. Be sure to turn off JavaScript to see the degradation in action. If you’re using the Web Developer Toolbar for Firefox, choose Disable > JavaScript > All JavaScript.

JavaScript for the rest of us

Now that we’ve implemented a non-JavaScript fallback, let’s make sure that the other 94% of users won’t notice anything “funny” about our carefully masked e-mail addresses. So, let’s revert the page’s DOM to what it looked like before the page’s source code was modified by the PHP script.

First, we need a JavaScript search and replace regex that does exactly the opposite of what our PHP regex did. I wrote a function around it that looks like this (line wraps marked » —Ed.):

function geo_decode(anchor) {
  var href = anchor.getAttribute(’href’);
  var address = href.replace(/.*contact/ »
  ([a-z0-9._%-]+)+([a-z0-9._%-]+)+([a-z.]+)/i, »
  ’$1’ + ’@’ + ’$2’ + ’.’ + ’$3’);
  if (href != address) {
    anchor.setAttribute(’href’,’mailto:’ + address);
}

Next, we must loop through all anchors on the page and tie the geo_decode() function to the onclick handler:

var links = document.getElementsByTagNameName(’a’);
for (var l = 0 ; l < links.length ; l++) {   links[l]. {
  geo_decode(this);
}

And finally, let’s attach the geo_decode() function to the window.onload object:

window.onload = function () {
  geo_decode();
}

To make things run smoothly, a little more code is involved. Take a look at geo.js.php to see how I implemented the ROT13 “decryption.” If you read through geo.phpclass.php, you’ll see that the link to geo.js.php (the file that restores your mailto links) is auto-inserted right before closing the head tag with the help of PHP’s output buffering. This means that you don’t have to add a single line of code to your existing documents to make the script work.

Try it yourself

I’ve set up a demo page for you to experiment with, and you can also play around with the source files:

  • .htaccess contains the Apache directives to prepend geo_prepend.php and to redirect page requests using mod_rewrite.
  • geo.prepend.php instantiates the PHP class and sets some custom properties.
  • geo.phpclass.php contains the PHP class that does the “encoding” and inserts a script tag before the closing head element that loads geo.js.
  • geo.js.php contains the JavaScript that’s responsible for the “decoding.”
  • mail.php contains an example of a usable fallback script for when JavaScript is unavailable.

...or download the ZIP archive (8 kB).

The script works in all major browsers, including Internet Explorer 5.01.

A solution. For now.

Alas, no e-mail address that appears online is entirely safe. Until all spam is banned from this world, we have to try our best not to make it too easy for spam harvesters to steal our addresses (and make money out of them). Now you can protect your addresses in a fully automated way while at the same time being gracious to all users, so you can focus on what’s really important: getting your content out.

This is only an interim solution. We should all be planning for the day when spam bots get smarter, and outwit them when they do. We should not pretend that legislation alone will be the silver bullet to address the world’s spam problem, so web developers will have to continue to come up with creative solutions to fight the problem—and masking your addresses is one of them. I look forward to reading your comments and suggestions.

About the Author

101 Reader Comments

Load Comments