A List Apart

Menu
Issue № 136

Build a Search Engine in PERL

by Published in The Server Side

It’s hard to work in the web business nowadays without hearing about Perl. (It’s hard to work in the web business nowadays, period, but that’s another story.) Perl is one of the most popular languages in use today, is completely free, open sourced, and supported by an extremely enthusiastic community.

In addition to the core features of the language, numerous modules and scripts are available on the Comprehensive Perl Archive Network (CPAN).

These ready–made scripts include everything from streaming MP3 servers to hooks into news clients, while the modules offer powerful generic functionality for almost any task imaginable. Everything on CPAN is completely free, and is redistributable under the same terms as Perl itself.

Getting started

Although Perl has many powerful built–in features, this article will only use those that are on a low–mid level for simplicity. However, you may find it helpful to take a few minutes to brush up on the syntax of the language.

Getting Perl itself

In addition, you may not have Perl on your system.  Here are a few places where it can be found:

For Windows users, two excellent choices are ActiveState Perl and Indigo Perl.  For purposes of this article, however,  Indigo Perl may be a better choice as it comes with a built–in version of Apache that can be used to test CGI scripts.

For Macintosh users (OS9 and below, OSX comes built with Perl), the recommend version is MacPerl.

Most Linux/Unix machines should have Perl already installed.

Welcome to CGI

One of the many things that Perl does well is processing text. Perl, therefore, is especially good at processing data from the Internet via CGI.

What is CGI, you ask? CGI stands for Common Gateway Interface; it is a way to receive and process information on a server. Think of it this way: a visitor sends information to the server; the server processes it in some way, then sends it back. CGI has many uses; it can do anything from retrieving information from a database to creating a robot capable of crawling the entire Internet.

Processing information sent to CGI scripts by hand can be very difficult; one must worry about different types of requests, transmission errors, security issues, etc. Luckily, Perl provides a module that will do most of the work for you: the CGI module.

Respect for the module

The CGI module is part of Perl’s core, and comes with every version of Perl, regardless of platform.  The CGI module makes it incredibly easy to parse form data, and takes into account many error and security issues that a normal programmer would probably overlook. (Not that I know any normal programmers.)

A CGI script written in Perl should always use the CGI module; in fact, I am not going to even show you how to write one without it.

The CGI module in action

Now, on to a quick tutorial of how to use the basic features of the CGI module.  First of all, we are going to need a basic form to pass the data to our script.  For now, it will include just a simple text box.

Crank up your favorite text editor, and enter the following XHTML markup (you can wrap the markup in your favorite styles or headers if you choose; the only important information that the CGI script needs is contained below):

<form acti method="post">
<input type="text" name="query" size="50" />
<input type="submit" />
</form></pre>

Here, we’ve created an XHTML form that can send data to a script located at /cgi-bin/search.pl via the post method. The form contains only two elements: a text box named “query,” and a submit button. Any text entered into the text box will be sent to /cgi-bin/search.pl when the visitor clicks the submit button.

Into the script

Now that we have our preliminary information down, let’s delve on into the CGI script.

Our first line will be the path to Perl: the place where the Perl executable is located on your system. On most Windows and Mac systems, we can abbreviate it as #!perl. Contact your systems administrator for exact details as to what your path to perl will be, and prevent frustration by finding out whether your server uses a .cgi file extension or a .pl file extension.

Next, we need to load our modules and pragmas.  We are going to enable warnings mode and the strict pragma.  (By enabling warnings and using strict, we will be forced to write clean, safe code. In addition, the Perl interpreter will give more descriptive error messages and help us catch typos as well.) Finally, we need to load the CGI module.

Thus, so far we have:

#!perl -w
use strict;
use CGI qw(:standard);

The first line tells the script where the Perl interpreter is located, and also adds the flag to turn warnings mode on (the “w” is for warnings, get it?).  Next, we use the strict pragma.  Finally, we load the CGI module, using the standard functional interface.

Now, we need to retrieve the form data.  We can do this with the help of CGI’s param function, which becomes available when we use the module:

my $query = param("query");

The param function is available after the CGI module is loaded.  It takes one argument: the name of the query parameter that you want. If you try to ask for data that doesn’t exist, it won’t return anything.  Simple, right?

In this instance, “query,” the element we are looking for, does exist, so that value is returned.  This return value is placed in a variable named $query.  It is always a good idea to give form elements and variables in your Perl script similar names; it will help prevent confusion in large applications.

Finally, we want to print this value to the browser so that we can see that the script is working. Before we do that, however, we need to print a proper content–type header to the browser so that it knows that we are passing html, and not an image, movie, applet, etc.

Luckily for us, we don’t have to consult a manual to find out the exact content–type header that we need; we can simply use CGI’s header function to find it out for us. Header is similar to param: it becomes available when the CGI module is loaded. In addition,  we are going to employ the start_html() and end_html() functions (also provided by the CGI module). They’ll output a standard HTML skeleton so we don’t have to.

#!perl -w
use strict;
use CGI qw(:standard);my $query = param("query");
print header();
print start_html();
print $query;
print end_html();

“That’s all well and good,” I hear you say, “but not very useful.”  True. Let’s get into something more interesting: a simple search engine.

Building the search engine

“Slow down, coach,” some of you may be grumbling.  “I’m a web designer, not a programmer. It took me some work just to get a handle on the DOM.  I thought this was supposed to be a introduction!”  Well, calm down, even a little programming experience is enough for what we are about to attempt.

We are going to build a quick ’n easy search engine for your site; we won’t be going anywhere near the complexity of a major search engine like Google or Yahoo.  In fact, we won’t need to do anything more complex than write a few calls to functions already provided for us by modules.

Finding the files

In order for us to search the website, we are going to have to recursively crawl through the directories and open the correct types of files.  Normally, this would be an extremely painful task to code, but Perl provides us with the File::Find module so that we won’t have to do it the hard way.

The File::Find module exports a function called Find(), which recursively crawls through directories.  Find() takes two arguments: a subroutine (a list of instructions of what to do to each file), and a starting directory.  From the starting directory, it will move to each file in the directory and subsequent sub–directories, returning a bunch of information to us (such as filename, path, etc) for each.

Let’s initialize our script:

#!perl -wuse strict;
use CGI qw(:standard);
use File::Find;
my $query = param("query");
print header();
print start_html();

You’ll notice that $query is still there, as well as header() and start_html(). You’ll also notice that we used the File::Find module similarly to the way we used CGI. Next, let’s output a document title:

print "n

For the query $query, these results were found:

n
    n";

(Nothing major, just a title.)

Next, we move on to the search process.  We are going to use the find function:

find( sub
{},
  '/home/username/public_html');

The first argument to find() is a reference to a subroutine (in this case, it’s an inline anonymous subroutine).  The second argument is the starting directory.  On most Apache servers, this will be /home/username/public_html, but check on your server to see what it is called.

Next, we are going to define our subroutine. First off, we will want to parse out any files that begin with a period (such as an .htaccess file—not something we want showing up in a search). Also, we will only want to search through files with an .html extension, so we need to parse out everything except them.

find( sub
{
 return if ($_ =~ /^./);
 return unless ($_ =~ /.html/i);
},
  '/home/username/public_html');

Since find is recursive, having the function return nothing is equivalent to “move on to the next file.”   Next, we define two regular expressions (or regex, for short).  The first checks to see if the filename begins (^) with a literal period (.), and returns if it does.  The next checks to see if the filename contains a literal period (.) followed by “html,” and to return unless those conditions are met.  The “i” modifier at the end will make the regex case–insensitive.  It is also worth noting that find() puts the current filename in the default variable ($_), which is what we match against.

Testing the files

Next, we need to perform some checks on the file, and for that we will make use of the stat() function.  File::Find makes the full file name available with $File::Find::name, so we are going to stat that.  After we use the stat function, several file tests become available to us. The two we will use are -d and -r.  -d checks to see if the current file is actually a directory, while -r makes sure the file is readable.

find( sub
{
 return if ($_ =~ /^./);
 return unless ($_ =~ /.html/i);
 stat $File::Find::name;
 return if -d;
 return unless -r;
},
  '/home/username/public_html');

Searching the files

Next, we are going to see if the file contains the terms we are searching for. To do that, we need to open the file and put its contents into a string.  However, since Perl views files as arrays and not strings under the default input record separator, we are going to have to undefine the input record separator in order to slurp the whole file up as a string. (If what I’ve just said confuses you, relax and swipe this code):

undef $/;
find( sub
{
 return if ($_ =~ /^./);
 return unless ($_ =~ /.html/i);
 stat $File::Find::name;
 return if -d;
 return unless -r; open(FILE, "< $File::Find::name") or return;
 my $string = ;
 close (FILE);
},
  '/home/username/public_html');

The greater–than symbol (<) before the file name is a security measure to ensure that the file is only opened for reading, so that no system commands accidentally get executed if the filename contains odd symbols (such as a pipe (|)). The “or return” at the end is an extra measure in case the file was not opened correctly.

Next, let’s check to see if the file contains our search string:

undef $/;
find( sub
{
 return if ($_ =~ /^./);
 return unless ($_ =~ /.html/i);
 stat $File::Find::name;
 return if -d;
 return unless -r; open(file, "< $File::Find::name") or return;
 my $string = ;
 close (FILE); return unless ($string =~ /Q$queryE/i);
},
  '/home/username/public_html');

A simple regex (regular expression, remember?) is used to determine if $query is within $string, which holds the contents of our file.  The QE are special regex delimiters that make any unsafe special characters safe for matching by our regex.

Displaying the results

Thus far, we know whether the file matched our not.  However, before we print our link to it, we will need some additional information: more precisely, a title for the link.

First, we will create a new variable named $page_title, and default its value to the current file name.  However, we can try to be more specific; if the page is written in (X)HTML, it will have a title, which we can capture with another of those regex functions you’ll grow to know and love:

undef $/;
find( sub
{
 return if($_ =~ /^./);
 return unless($_ =~ /.html/i);
 stat $File::Find::name;
 return if -d;
 return unless -r; open(FILE, "< $File::Find::name") or return;
 my $string = ;
 close (FILE); return unless ($string =~ /Q$queryE/i);
 my $page_title = $_;
 if ($string =~ /<title>(.*?)</title>/is)
 {
     $page_title = $1;
 }
},
'/home/username/public_html');

The results of the match will be contained in the special variable $1 if the match occurs, and $page_title will be assigned its results.  If there wasn’t a match, $page_title is still equal to the current file name, so the link will have a title no matter what.

Finally, it’s time to output our link:

undef $/;
find( sub
{
 return if($_ =~ /^./);
 return unless($_ =~ /.html/i);
 stat $File::Find::name;
 return if -d;
 return unless -r; open(FILE, "< $File::Find::name") or return;
 my $string = ;
 close (FILE); return unless ($string =~ /Q$queryE/i);
 my $page_title = $_;
 if ($string =~ /<title>(.*?)</title>/is)
 {
     $page_title = $1;
 }
 print "
  • $page_title
  • n"; }, '/home/username/public_html');

    With our completed find function in hand, we finish out the document with end_html, and come up with the following, all in only 30 lines of Perl:

    #!perl -w
    use strict;
    use File::Find;
    use CGI qw(:standard);
    my $query = param("query");
    print header();
    print start_html();
    print "n				
    				
    				
    				
    								
    								
    				
    				

    For the query $query, these results were found:

    n
      n"; undef $/;find( sub { return if($_ =~ /^./); return unless($_ =~ /.html/i); stat $File::Find::name; return if -d; return unless -r; open(FILE, "< $File::Find::name") or return; my $string = ; close (FILE); return unless ($string =~ /Q$queryE/i); my $page_title = $_; if ($string =~ /<title>(.*?)</title>/is) { $page_title = $1; } print "
    1. $page_title
    2. n"; }, '/home/username/public_html');print "
    n"; print end_html();End

    Perl is powerful and, as programming languages go, fairly straightforward once you learn a few terms and overcome a few fears. And it works in every browser since the Stone Age. Happy programming!

    About the Author

    No Comments

    1. Sorry, commenting is closed on this article.