Google’s caching system offers several cool features; one of most useful is that the words you searched for are highlighted in the page. Most web users don’t read pages carefully — they scan text for what they’re looking for. This is why Google’s cached-page highlighting is so useful. When the page is rendered, users don’t need to read the entire page to find what they came for, the page shows them where it is. As a quick example, the words highlighted above most likely caught your eye before you actually got to reading them.
Usability heuristics state that users should not have to remember information from one site to the next. Wouldn’t it be great if you could extend search-term highlighting to the pages on your own website any time a visitor came from a search engine? How about also highlighting search terms from your own site’s search tool?
We’ve written a script in PHP that you can add to individual pages or entire websites that will automatically highlight words in your page if the user has followed a link from a search engine results page. You can skip the implementation overview and installation instructions and go straight to the script if you like.
search.php?q=SEARCH+TERMS+HERE&l=en. With these keys and values, you can determine what terms were used on the search engine that listed your site as a result.
The next step is to find all words in your page that match those that the user searched for on the search engine. Once you have a complete list of terms from the referrer’s query string, you wrap each instance of a term in a
span element with a special class. Using your site’s cascading style sheets, you then highlight these terms using background colors, font weights, or different voices (depending on the target medium) so that they are more apparent to the user. We gave each search term a different class so the terms can be highlighted in different ways (e.g. every mention of “color” is highlighted in yellow, every mention of “coding” is highlighted blue, and so on).
This sounds fairly easy but there are complications that need to be considered. If the visitor searches for “div,” you don’t want to replace all the
<div> tags with
You also don’t want to add
span elements inside any attribute values, or you’ll end up with something like
<img src="example.png" alt="This is an example <b><span class=">image"/>. We need to strip out the tags from the plain text, parse the plain text for search terms and wrap any instances in
span tags, and finally put the plain text and the tags back together again — without changing the original structure or rendering of the page.
We accomplished this using regular expressions, a powerful tool that allows you to match patterns of text (see CPAN for a basic tutorial on using regular expressions). If you want to find an HTML tag you could use PHP’s string searching functions to find every possible combination of tags, but that takes a lot of work; with regular expressions you simply search for patterns.
We use a pattern analogous to saying “look for ‘<’ followed by any amount of characters that are not ‘>’, followed by ‘>’”. The HTML file acts as the input string the regular expression tries to match the pattern against. Using this we were able to separate the HTML tags and the plain text. We then take the untagged plain text and add the span tags around search terms, then put back the HTML tags in their original positions. This way any semantic meaning and presentation — visual, aural, or otherwise — is preserved, along with the structure and validity of markup.
Considerations for dynamically generated pages
So far we have concentrated on static files, and you may be wondering how the highlighting functionality can be applied to dynamic pages, i.e. those that are not created in full until they are sent to the user-agent. This problem is solved with PHP’s output buffering. By calling a single function,
<a href="http://www.php.net/manual/en/function.ob-start.php">ob_start, at the top of your PHP scripts, output is held in a buffer until you choose to output it to the HTTP stream. The
ob_start function takes the name of a function as its single argument. As the buffer is about to be output this function is called with the buffer’s contents passed as a parameter. Whatever the function returns is sent out into the ether to the user-agent. We can use this to modify the buffer by adding our highlighting
Blimey. That’s enough techie-talk; time for a demonstration. We’ve rigged up a demo search engine: run a search, follow the result, and the resulting page will highlight your search terms.
Adding it to your website
Whether you run a large or small domain, new technology needs to be easily deployed and maintained. There are several ways to include the search engine highlighting function into your PHP code. Here are just two.
The first method all depends on how trusting your system admininstrator is, but if you use the Apache web server, you may be able to add a
php_value auto_prepend_file command to a .htaccess file. This asks Apache to add the contents of a file to the top of each page it serves. So to add the search-engine highlighting functionality to your site you should add a line like:
php_value auto_prepend_file "/path/to/your/header.inc"
The header.inc file should contain the following code:
<?php include('/absolute/path/to/sehl.php'); ob_start('sehl'); ?>
Notice that the
ob_start() function takes one parameter, in this case a callback function,
sehl (an abbreviation for “search engine highlight”). This is the function that will be called when the buffer is automatically flushed. The PHP include statement includes sehl.php, which contains the
sehl function. Once you’ve finished this minor fiddling you’re good to go. It’s important to note that Apache’s .htaccess file is a complex beastie, so if you want to know more you should read Apache’s .htaccess file tutorial.
If you can’t use .htaccess files or you’re getting server errors, you won’t be able use
php_value auto_prepend_file. That’s not a big problem because there is another method you can use to include the highlighting functionality. In each PHP script you want to have search-engine highlighting, simply add a line at the top of script that includes the header.inc file like so:
Notes on efficiencies
There are several points to be aware of before adding the search-engine highlighting script to your site. Regular expressions are very complex and use lots of computer resources in attempting to match strings. The larger the body of text, the more work the system has to do; this can potentially harm performance. Output buffering requires a small overhead as well — the system has to hold your page in memory, edit it, then send a copy to the user.
Small- to medium-sized sites should not have any need to worry, but large-scale sites with millions of hits would need to evaluate the best possible way to implement this function. In an attempt at optimization, the
sehl function will only execute a bare minimum of code if the referrer is not thought to be a search engine. No regular expressions will be be used and no words will be highlighted.
Customizing the script
In its current state, the
sehl function will add a short explanation to the top of each page it highlights word in, like so:
A nice extension to this would be to add links to each instance of the highlighted words as demonstrated below:
These numbered links would be anchors that jump through the page to the highlighted words. It would also be possible to integrate this into your own site’s search engine (e.g. Atomz site search). You already know the search terms the users are interested in, now you can pass those onto other services.
The current implementation is clever enough to make sure it does not highlight partial matches, that is it will not highlight “day” inside of “today”. It is also case-insensitive, so a search for “day” will result in “Day”, “DAY”, etc. also being highlighted. These can both be easily changed to highlight partial matches and be case-sensitive respectively by making small changes to the regular expressions.
How to get the script
We expect this to be an ongoing project; you will always find the latest version of the search engine highlight code on Brian’s site. Additionally, A List Apart hosts the version used at the time of writing (zip file, 7.2KB).
There are probably a million and one different ways that the code could be improved (we’ve already started on a fully object-oriented version ourselves), and any comments are welcome. We’ve released this code under the GNU General Public Licence, so you’re welcome to port the code to other scripting languages and do with it what you will. Enjoy!