How to Succeed With URLs

If you’re building or maintaining a dynamic website, you may have considered the problem of how to get rid of unfriendly URLs. You
might also have read Bill Humphries’s ALA article on the topic, which presents one (very good) solution to this problem.

Article Continues Below

The main difference between Bill Humphries’s article and the
solution I will present here is that I decided to do the actual
URL transformations with a PHP script, whereas his solution uses
regular expressions in an .htaccess file.

If you prefer working with PHP instead of using regular
expressions, and if you want to integrate your solution with your
dynamic PHP sites, this might be the right method for you.

Why worry about URLs?#section1

Good URLs should have a form like /products/cars/bmw/z8/ or
/articles/january.htm and not something like index.php?id=12. But the latter is the kind of URL most publishing systems generate. Are we stuck with bad URLs? No.

The idea is to create “virtual” URLs that look nice and can be indexed
by bots (if you set your links this way also) – in fact, the
URLs for your dynamic content can have any form you like, but at
the same time static content (that might also be on your server)
can be reached by its regular URL.

When I built my new site, I was looking for a way to keep my URLs
friendly by following these steps:

  1. A user enters a URL like www.mycars.com/cars/bmw/z8/
  2. The code checks to see if the entered URL maps to an existing static HTML file
  3. If yes, the file is loaded, if no, step 4 is executed
  4. The URL string is used to check if there is dynamic content corresponding to the entered URL (e.g. in a database).
  5. If yes, the article will be displayed
  6. If no, an Error 404 or a custom error message will be displayed.

A Collection of tools#section2

This article will provide you with all the information necessary
to implement this solution, but it’s more a collection of tools
than a complete step-by-step guide to a finished solution. Before you start, make sure you have the following:

  1. mod_rewrite and .htaccess files
  2. PHP (and a basic understanding of PHP programming)
  3. a database like mySQL (optional)

The index takes it all#section3

After browsing the web and checking some forums, I found the
following solution to be the most powerful: All requests (with
some important exceptions – see below) for the server will be
redirected to a single PHP script, which will handle the
requested URL and decide which content to load, if any.

This redirection is done using a file named .htaccess that
contains the following commands::

RewriteEngine on
RewriteRule !.(gif|jpg|png|css)$ /your_web_root/index.php

The first line switches the rewrite engine (mod_rewrite) on.  The
second line redirects all requests to a file index.php EXCEPT
for requests for image files or CSS files.

(You will need to enter the path to your web-root directory
instead of “your_web_root”. Important: This is something like
”/home/web/” rather than something like
“http://www.mydomain.com.”)

You can put the .htaccess file either in your root directory or
in a sub-directory, but if you put the file in a sub-directory,
only requests for files and directories “below” this particular
directory will be affected.

The magic inside index.php#section4

Now that we’ve redirected all requests to index.php, we need to
decide how to deal with them.

Have a look at the following PHP Code, explanations follow below.

<?php
//1. check to see if a "real" file exists..if(file_exists($DOCUMENT_ROOT.$REQUEST_URI)
and ($SCRIPT_FILENAME!=$DOCUMENT_ROOT.$REQUEST_URI)
and ($REQUEST_URI!="/")){
$url=$REQUEST_URI;
include($DOCUMENT_ROOT.$url);
exit();
}//2. if not, go ahead and check for dynamic content.
$url=strip_tags($REQUEST_URI);
$url_array=explode("/",$url);
array_shift($url_array); //the first one is empty anywayif(empty($url_array)){ //we got a request for the index
include("includes/inc_index.php"); 
exit();
}//Look if anything in the Database matches the request 
//This is an empty prototype. Insert your solution here.
if(check_db($url_array)==true()){
do_some_stuff(); output_some_content(); 
exit();
}//3. nothing in DB either  Error 404!
}else{
header("HTTP/1.1 404 Not Found"); 
exit();
}

Step 1, lines 1-9: check to see if a “real” file exists:#section5

First we want to see if a existing file matches the request.
(This might be a static html file but also a php or cgi script.)
If there is such a file, we just include it.

On line 3, we check to see if a corresponding file is in the
directory tree using $DOCUMENT_ROOT and $REQUEST_URI. If a
request is something like www.mycars.com/bmw/z8/, then
$REQUEST_URI contains /bmw/z8/. $DOCUMENT_ROOT is a constant
which contains your document root – the directory where your web
files are located. 

Line 4 is very important: We check to see if
the request was not for the file index.php itself – if it were,
and we just went ahead, it would lead to an endless loop!

On line 5, we check for another special case: a REQUEST_URI that
contains a “/” only – that would also be a request for the
actual index file. If you don’t do this check, it will lead to a
PHP Error. (We will deal with this case later on.)

If a request passes all these checks, we load the file using
include() and stop the execution of index.php using exit().

Step 2, lines 14-28: check for dynamic content:

First, we transform the $REQUEST_URI to an array which is easier
to handle:

We use strip_tags() to remove HTML or JavaScript tags from the
Query String (basic hack protection), and then use explode() to
split the $REQUEST_URI at the slashes (”/”). Finally, using
array_shift(), we remove the first array entry because it’s
always empty. (Why? Because $REQUEST_URI always starts with a
“/”).

All the elements of the request string are now stored in
$url_array. If the request was for www.mycars.com/bmw/z8/, then
$url_array[0] contains “bmw” and $url_array[1] contains “z8.”
There is also a third entry $url_array[2] which is empty – if
the user did not forget the trailing slash.

How you deal with this third entry depends on what you want to
do; just do whatever fits your needs.

What if that $url_array is empty? You may have realized that this
corresponds to the case of the $REQUEST_URI containing only a
slash (”/”), which I mentioned above.

This is the case when the
request is for the index file (www.mycars.com or
www.mycars.com/). My solution is to just include the content for
the mainpage, but you could also load an entry from a database.

Any other request is now ready to use. At this point your
creativity comes into play – now you can use the URL elements to
load your dynamic content. You could, for example, check your
database for content that matches the query string; this is
sketched in pseudo code on lines 25-28.

Suppose you have a string like /articles/january.htm. In this
case, $url_array[0] contains “articles” and $url_array[1]
contains “january.htm.” If you store your articles in a table
“articles” that includes a column “month,” your code could lead
to a query like this:

str_replace (".htm","", $url_array[1]); 
//removes .htm from the url 
$query="SELECT * FROM $url_array[0] WHERE
month='$url_array[1]'";

You could also transform the $url_array and call a script, much
as Bill Humphries suggests in his article. (You need to call the
script via the include() function.)

Step 3, lines 30-32: nothing found.#section6

The last step deals with the case that we neither found a
matching static file in step one, nor did we find dynamic content
matching the request – that means that we have to output an
Error 404. In PHP this is done using the header() function. (You
can see the syntax to output the 404 above.)

Beware of hackers#section7

One part of this procedure creates a few vulnerabilities. In step
one, when you check for a existing file, you actually access the
file system of your server.

Usually, requests from the web should have very limited rights,
but this depends on how carefully your server is set up.  If
someone entered ../../../  or something like
/.a_dangerous_script,  this could allow them to access
directories below your web-root or execute scripts on your
server. It’s usually not that easy, but be sure to check some of
those possible vulnerabilities.

It’s a good idea to strip HTML, JavaScript (and maybe SQL) tags
from the querystring; HTML and Javascript tags can easily be
removed using strip_tags(). Another wise thing to do is limit the
length of the query string, which you could do with this code:

if(strlen($REQUEST_URI)>100){
header("HTTP/1.1 404 Not Found"); exit;
}

If somebody enters a query string of more than 100 symbols, a 404
is returned and the script execution is stopped. You can just add
these (and other security related functions) at the beginning of
the script.

How to deal with password protected directories and cgi-bin#section8

After I had implemented the whole thing, I realized that there
was another problem. I have some password protected directories,
e.g. for my access statistics.  When you want to include a file
in one of these directories, it won’t work because the PHP Module
has a different user which cannot access this directory.

To solve this problem, you need to add some lines to your
.htaccess file, one for each protected directory (in this example
the directory /stats/):

RewriteEngine on
RewriteRule   ^stats/.*$      -                  [L]
RewriteRule !.(gif|jpg|png|css)$ /your_web_root/index.php

The new rule on the second line excludes all access for /stats/
from our redirection rule. The “-” means that nothing is done
with the request, and the [L] stops execution of the .htaccess if
the rule at this particular line was applied. The original rule
on the third line is applied to all other requests.

I recommend the same solution for your cgi-bin directory or other
directories where scripts that take GET queries reside.

Resources#section9

PHP/mySQL
Official PHP Site and Language Reference
Some excellent Tutorials on PHP and
mySQL
Mod Rewrite
Official mod_rewrite Docs
Examples for mod_rewrite

About the Author

Till Quack

An independent developer and engineering student based in Switzerland, Till Quack performs web design and web application development through his company, Quack Internet Solutions. He chooses web standards rather than proprietary formats for his designs, and Open Source solutions like PHP for his web applications.

No Comments

Got something to say?

We have turned off comments, but you can see what folks had to say before we did so.

More from ALA

Nothing Fails Like Success

Our own @zeldman paints the complicated catch-22 that our free, democratized web has with our money-making capitalist roots. As creators, how do we untangle this web? #LetsFixThis