Enhance Usability by Highlighting Search Terms

Google’s caching system offers several cool features; one of most useful is that the words you searched for are highlighted in the page. Most web users don’t read pages carefully — they scan text for what they’re looking for. This is why Google’s cached-page highlighting is so useful. When the page is rendered, users don’t need to read the entire page to find what they came for, the page shows them where it is. As a quick example, the words highlighted above most likely caught your eye before you actually got to reading them.

Article Continues Below

Usability heuristics state that users should not have to remember information from one site to the next. Wouldn’t it be great if you could extend search-term highlighting to the pages on your own website any time a visitor came from a search engine? How about also highlighting search terms from your own site’s search tool?

We’ve written a script in PHP that you can add to individual pages or entire websites that will automatically highlight words in your page if the user has followed a link from a search engine results page. You can skip the implementation overview and installation instructions and go straight to the script if you like.
 

Implementation#section1

When someone visits your site from a search engine results page, that results page’s URL is sent on to your site. This is known as the referring URL or referrer (the HTTP specification misspells this as “referer’), and can be accessed via scripting languages such as PHP, Python, and ECMAScript / JavaScript. In that referrer there is a query string (assuming the search engine uses the HTTP “get’ method, something all the search engines we know do), which contains several keys and values. These look something like search.php?q=SEARCH+TERMS+HERE&l=en. With these keys and values, you can determine what terms were used on the search engine that listed your site as a result.

The next step is to find all words in your page that match those that the user searched for on the search engine. Once you have a complete list of terms from the referrer’s query string, you wrap each instance of a term in a span element with a special class. Using your site’s cascading style sheets, you then highlight these terms using background colors, font weights, or different voices (depending on the target medium) so that they are more apparent to the user. We gave each search term a different class so the terms can be highlighted in different ways (e.g. every mention of “color” is highlighted in yellow, every mention of “coding” is highlighted blue, and so on).

This sounds fairly easy but there are complications that need to be considered. If the visitor searches for “div,” you don’t want to replace all the <div> tags with <div>.
You also don’t want to add span elements inside any attribute values, or you’ll end up with something like <img src="example.png" alt="This is an example <b><span class=">image"/>. We need to strip out the tags from the plain text, parse the plain text for search terms and wrap any instances in span tags, and finally put the plain text and the tags back together again — without changing the original structure or rendering of the page.

We accomplished this using regular expressions, a powerful tool that allows you to match patterns of text (see CPAN for a basic tutorial on using regular expressions). If you want to find an HTML tag you could use PHP’s string searching functions to find every possible combination of tags, but that takes a lot of work; with regular expressions you simply search for patterns.

We use a pattern analogous to saying “look for ‘’, followed by ‘>’”. The HTML file acts as the input string the regular expression tries to match the pattern against. Using this we were able to separate the HTML tags and the plain text. We then take the untagged plain text and add the span tags around search terms, then put back the HTML tags in their original positions. This way any semantic meaning and presentation — visual, aural, or otherwise — is preserved, along with the structure and validity of markup.

Considerations for dynamically generated pages#section2

So far we have concentrated on static files, and you may be wondering how the highlighting functionality can be applied to dynamic pages, i.e. those that are not created in full until they are sent to the user-agent. This problem is solved with PHP’s output buffering. By calling a single function, <a href="http://www.php.net/manual/en/function.ob-start.php">ob_start, at the top of your PHP scripts, output is held in a buffer until you choose to output it to the HTTP stream. The ob_start function takes the name of a function as its single argument. As the buffer is about to be output this function is called with the buffer’s contents passed as a parameter. Whatever the function returns is sent out into the ether to the user-agent. We can use this to modify the buffer by adding our highlighting span tags.

Blimey. That’s enough techie-talk; time for a demonstration. We’ve rigged up a demo search engine: run a search, follow the result, and the resulting page will highlight your search terms.

Adding it to your website#section3

Whether you run a large or small domain, new technology needs to be easily deployed and maintained. There are several ways to include the search engine highlighting function into your PHP code. Here are just two.

The first method all depends on how trusting your system admininstrator is, but if you use the Apache web server, you may be able to add a php_value auto_prepend_file command to a .htaccess file. This asks Apache to add the contents of a file to the top of each page it serves. So to add the search-engine highlighting functionality to your site you should add a line like:

php_value auto_prepend_file "/path/to/your/header.inc"

The header.inc file should contain the following code:

<?php
  include('/absolute/path/to/sehl.php');
  ob_start('sehl');
?>

Notice that the ob_start() function takes one parameter, in this case a callback function, sehl (an abbreviation for “search engine highlight”). This is the function that will be called when the buffer is automatically flushed. The PHP include statement includes sehl.php, which contains the sehl function. Once you’ve finished this minor fiddling you’re good to go. It’s important to note that Apache’s .htaccess file is a complex beastie, so if you want to know more you should read Apache’s .htaccess file tutorial.

If you can’t use .htaccess files or you’re getting server errors, you won’t be able use php_value auto_prepend_file. That’s not a big problem because there is another method you can use to include the highlighting functionality. In each PHP script you want to have search-engine highlighting, simply add a line at the top of script that includes the header.inc file like so:

include('/path/to/your/header.inc');

Notes on efficiencies#section4

There are several points to be aware of before adding the search-engine highlighting script to your site. Regular expressions are very complex and use lots of computer resources in attempting to match strings. The larger the body of text, the more work the system has to do; this can potentially harm performance. Output buffering requires a small overhead as well — the system has to hold your page in memory, edit it, then send a copy to the user.

Small- to medium-sized sites should not have any need to worry, but large-scale sites with millions of hits would need to evaluate the best possible way to implement this function. In an attempt at optimization, the sehl function will only execute a bare minimum of code if the referrer is not thought to be a search engine. No regular expressions will be be used and no words will be highlighted.

Customizing the script#section5

In its current state, the sehl function will add a short explanation to the top of each page it highlights word in, like so:

Why are some words highlighted in this page?#section6

This site’s search-engine highlighting feature marks the words you just searched for easy identification.

A nice extension to this would be to add links to each instance of the highlighted words as demonstrated below:

You have just searched for search terms here; there are 6 instances on this page: 1, 2, 3, 4, 5, and 6.

These numbered links would be anchors that jump through the page to the highlighted words. It would also be possible to integrate this into your own site’s search engine (e.g. Atomz site search). You already know the search terms the users are interested in, now you can pass those onto other services.

You have just searched for search terms here; there are 6 instances on this page: 1, 2, 3, 4, 5, 6. Our own search engine has found 34 additional pages that match your search terms.

The current implementation is clever enough to make sure it does not highlight partial matches, that is it will not highlight “day” inside of “today”. It is also case-insensitive, so a search for “day” will result in “Day”, “DAY”, etc. also being highlighted. These can both be easily changed to highlight partial matches and be case-sensitive respectively by making small changes to the regular expressions.

How to get the script#section7

We expect this to be an ongoing project; you will always find the latest version of the search engine highlight code on Brian’s site. Additionally, A List Apart hosts the version used at the time of writing (zip file, 7.2KB).

There are probably a million and one different ways that the code could be improved (we’ve already started on a fully object-oriented version ourselves), and any comments are welcome. We’ve released this code under the GNU General Public Licence, so you’re welcome to port the code to other scripting languages and do with it what you will. Enjoy!

About the Author

Matt Riggott

Matt Riggott is an informatician dreaming of a semantic web. At the time of writing he’s living in Edinburgh, Scotland trying to do interesting things with PHP and Python. When not being too geeky he enjoys classical philosophy, bike-riding, running, and pretending he knows what he's doing.

27 Reader Comments

  1. Before we go implementing this everywhere, we should ask if this is really a good thing? Usability is mentioned in the article, and it says that search term highlighting is good, but where’s the proof? What’s the reality?

    A quick informal survey this morning (of web developers and web users both) shows that people I know don’t seem to like search term highlighting. In fact, it distracts attention from the content you’re looking for. Each person I talked to said, at best, they ignore the highlighting because they’re in a different mode of scanning the page once they reach it.

    How do people then know the difference between explicitly highlighted items on the page and the ones that are done by the search term highlighter? Also, if they’re really looking for the terms (that they already know are on the page), why not use the “Find” feature in the browser? That’s what it’s for. This way it’s left to the user.

    If nothing else, maybe it would be better to put a little toggle near the top of the page to turn highlighting on and off — so they could use that rather than the “Find” command when they’re actually interested in finding the terms.

    In any case, I just want to make sure that people consider whether it’s really useful to their users before they implement this. I’d rather not have every site I visit highlighting random words throughout the pages.

  2. Better yet, why not just make the text a different colour rather than highlighting it? It would single out the text still, but would do so much more subtly.

  3. The problem with your naive expression is that it is easily broken by perfectly valid HTML. A few examples:

    What about an image with ALT=”> Comment”?

    What about a comment with a > in it?

    What about pages with scripting that does
    numeric > comparisons ?

    I highly recommend looking at a full-on HTML parser if you want to avoid potential problems.
    http://php-html.sourceforge.net/ is one such.

    You can see the general method (albeit implemented in perl) in the second code listing at this link:

    http://perlmonks.org/index.pl?node_id=370246

  4. I agree with Justin Greer. Check out Firefox’s “find” feature. It makes it easy for those who WANT search term highlighting to access it.

  5. One additional problem is a search utilizing inclusionary or exclusionary syntax (‘+’ or ‘-‘). I pretty much always search using advanced filtering techniques. While this works well for basic searches, it fails to take into account common advanced search strategies.

  6. …but it does seem that using Javascript would be more appropriate. Wouldn’t doing it server side have odd results if a browser is using a proxy?

  7. I agree that a clientside scripted solution would be far better, and easier to implement, in that you don’t have to go dabbling with every bit of text that’s output in a serverside script.

    For semantics, it’d be good to have said script run through the text, and wrap the highlighted terms in elements, as well as dynamically adding a stylesheet with (ideally) something like:

    strong.highlight {font-weight:inherit}
    strong.highlight:contains(keyword1) { background:#8FF }

    strong.highlight:contains(keyword2) { background:#FF0 }

    -or more realistically than ideally, using a different classname for each highlighted term.

    Using :contains would complicate things anyway, in that you’d have to make sure that the rule for :contains(cat) comes before the one for :contains(category).

  8. Thanks for the comments so far; we’ll take them on board for future releases of the script. Here’s our considered response to some of the questions and criticisms so far:

    “Why not just make the text a different colour rather than highlighting it?”
    That’s up to you and your style sheets. The keyword highlighter may be better described as keyword tagger; it surrounds the words the visitor searched for with a span tag that has a special class. You can use this class to highlight the words anyway you want with CSS (without needing to touch the PHP code at all). Whether that is by changing the colour of the text or adding verbal stress to the word for screen readers is up to you. You may even want to change the use of the span element to that of strong or em for a more semantic value; it is your decision.

    “How about doing this with Javascript”
    You could implement this in Javascript (as some have done) or any client-side technology. While we have no qualms about doing this, we decided a server-side method held more benefits. For example, you can guarantee that all user-agents will receive the same output on the server-side, something you cannot say about a client-side script (e.g. clients with no understanding of Javascript will not highlight anything). You can also augment the server-side script in ways you couldn’t client-side, such as integrating it with a site-wide search as we mention in the article.

    “Using Javascript allows you to disable the highlighting without an extra trip to the server”
    Let’s polarise this: instead of using Javascript to highlight the search terms, why not continue to highlight server-side, and add Javascript to the page to allow the added span elements to be removed from the DOM? That would remove the need to make a request to the server, without the main functionality relying on client-side code. Another way would be to add an alternative stylesheet that turns off the highlighting.

    “Your regular expressions for parsing the HTML are naive”
    Implementing a full SGML/XML parser was well beyond the scope of our project, although it may be a good idea for a future version. The main focus of the project was on usability and allowing users to find the information they want faster.

    “One additional problem is a search utilising inclusive or exclusive syntax (‘+’ or ‘-‘)”
    We hadn’t thought about this, so thanks for pointing it out. If a user searches for “food -dog”, the highlighting function would try to highlight the words “food” and “-dog”, rather than “dog”. We’ll put in the to-do list and fix it as soon as we can.

    “Searching for < ! breaks the highlighting function" Good catch! We missed that one, but it's been fixed in version 1.8.1 (available from http://suda.co.uk/projects/SEHL/). The problem was that while the special HTML characters were being properly escaped when being searched for, they weren’t when being displayed in the little advisory at the top of the page. So the highlighter was never broken, but the advisory note at the top was.

    “Why not use Firefox’s find-in-page feature instead?”
    You can, but you can only search for one word or phrase at a time. The highlighting function is available to any visitor without extra effort on their part, and it shows any number of distinct phrases on the page simultaneously.

    “Wouldn’t doing it server side have odd results if a browser is using a proxy?”
    If we understand you correctly the implication here is that the referring URL would be removed by the proxy; thus nothing will be highlighted, the user is none the wiser, and the system degrades gracefully. Can anyone think of any other problems with a client behind a proxy?

    On a final note, we hope you see our code as a seed for future ideas, to allow you to think about how best to provide information to web users, to do more than just provide static information. Remember, our code is released under the GPL, so please feel free to take it, fork it, and make it better! If you have any more comments please let us know. Cheers!

  9. resumé seems to break it. This word was on the test search site, but it just highlights the whole thing!

    Says up the top “Why is resum鼯span> highlighted?

  10. Instead of using auto_prepend php file you can achieve the same thing with Apache 2.0 output filters, and this way you can add search term highlighting to static html files also.

    You need to have mod_ext_filter loaded, and then use this in server level config:

    ExtFilterDefine highlight-search-terms cmd=”/path/to/php /path/to/highlight_search_terms.php”

    And use this directive in or block depending on where you want this filter to apply:

    SetOutputFilter highlight-search-terms

    The highlighting script (be it PHP, Python or whatever) should read from stdin, do its thing and output to stdout.

  11. Nice script, but when your search term includes a German “Umlaut” like “ü” (ü), *all* text vanishes!

    Also I got strange results when the searched text includes several

    s, , and < ?php ?> tags, don’t know which is responsible. Then the first

    block is ignored and only text in the second is highlighted…
  12. Highlighting is nice if the words you were looking for are in an obscure part of the page. However, it quickly becomes annoying when the searched words are very common in the main text. Precisely because those words stand out, it gets very hard to read the rest.

  13. I’d recommend using XML/DOM parser instead of risky and costly regular expressions. And here you go – another advantage of using valid XHTML.

Got something to say?

We have turned off comments, but you can see what folks had to say before we did so.

More from ALA

Nothing Fails Like Success

Our own @zeldman paints the complicated catch-22 that our free, democratized web has with our money-making capitalist roots. As creators, how do we untangle this web? #LetsFixThis