One of the most powerful security tools available to web developers is cryptography—essentially a process by which meaningful information is turned into random noise, unreadable except where specifically intended. Many governments tightly regulated cryptographic software until relatively recently. But cryptography is simply applied math, and it’s proved impossible to suppress. A web developer working on an underpowered netbook in his basement now has access to cryptosystems that major governments could only have dreamed of a few decades ago.
It’s tempting to think that a simple web application doesn’t require industrial-strength security because it’s not going to contain any information worth stealing, like credit card numbers. Unfortunately, people’s tendency to reuse passwords across many different systems means that a compromise of the weakest one can often give access to the rest. In 2009, a hacker famously gained access to a number of internal systems at Twitter by compromising the email account of a single administrative assistant.
Just as any artisan must know his tools and materials to achieve excellence, it’s important for web developers to understand what types of cryptography work in specific contexts, even if we don’t need to grasp the details of the math involved. Poorly applied cryptography provides no benefit and might even be dangerous by creating a false sense of security.
There are many different types of cryptosystems, but three broad categories commonly relate to web applications.
You can’t start with a sausage and work backwards to produce the raw ingredients that went into it. In the same way, a cryptographic hash is designed to take in data and mangle it, so that what comes out the other end is unrecognizable—irreversibly so. Described this way, it doesn’t sound like a particularly difficult or useful operation. But well-designed hashes have two properties that make them both complex and useful.
First, a small change in the input should make a drastic change in the output. In other words, changing one character in the data being hashed will change a lot more than one character of the output. Usually, hashes produce an output of fixed (and relatively short) length, regardless of the input size. This means that multiple inputs could theoretically produce the same result, making it impossible to know what the original data was if an attacker only has the hashed result to work with.
Second, a hash will always produce the same output when given the same input. The most obvious application for hashes is in the storage of passwords. A web application doesn’t really need to know a user’s password—it just needs to verify that the person requesting access knows the password. If we hash passwords the same way at the time of creation and login, we only have to store and compare the result of the hash operation to know that the originals are the same. Then, even if a user database is exposed, the attacker doesn’t have anything terribly useful to work with. At least, that’s the conventional wisdom.
In reality, it’s not quite that simple. First, not all hashing algorithms are created equal. The once-popular MD5 algorithm, for example, is now known to be cryptographically weak in general (although it’s still usable for passwords). Second, we know that a lot of people choose the same common passwords. This means that if an attacker knows the hash value of “123456” as produced by a few popular algorithms, he can easily recognize it in a database. Lists of pre-computed hashes are widely available and known in the security industry as rainbow tables.
To counter this weakness, a hash can be “salted.” Given the two properties of good hash algorithms described above, we can simply append a little data of our own to the user’s password and store the hash of that combined text rather than the password itself. This will create a completely different result that’s still easily verifiable.
Compare the following:
sha1('password') => 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8sha1('x5T1password') => e1f9530af9bde38db0eef386c4d22ec2ba10d2fe
In this example, adding four random characters to the beginning of the word changes 95% of the resulting output. The SHA-1 algorithm, developed by the US National Security Agency, is currently the best available hash function that enjoys broad support in most popular programming languages.
Of course, one-way functions have a fairly narrow use. In many cases, information is encrypted only to ensure that it can’t be read outside its intended context. But within that context—an administrative console, for example—the encryption needs to be reversible.
However, the first question an application developer should always ask is: “Do I really need to collect and store this information?” Keeping data collection to an absolute minimum usually contributes to an improved user experience, simplifies development, and is naturally more secure. After all, data that doesn’t exist can’t be stolen or exploited.
But assuming that sensitive information is really crucial to an application’s function, we have to consider how to handle it securely. Reversible encryption falls into two categories. In the first, a single secret “key” is used both for scrambling and unscrambling the data. A key is somewhat like a password, but since keys are more likely to be used by programs than people, they can (and should) be longer and completely random.
With modern algorithms, symmetric encryption has the advantage of being extremely fast. The strong AES algorithm (also known as the Rijndael cipher) was specifically designed for speed and is well supported, with implementations in both database servers and application frameworks. Encrypting and decrypting data in MySQL, for example, is as simple as the following:
INSERT INTO people (pet_name) VALUES (AES_ENCRYPT('Schmoopie','my-secret-key'));SELECT AES_DECRYPT(pet_name, 'my-secret-key') AS pet_name FROM people;
This doesn’t protect the information from exposure if a malicious user gains access to the web application, since it knows how to decrypt the data. It does, however, protect against accidental disclosure in other contexts, like backup files or an attack on the database itself.
Symmetric encryption works well when we only need to access the information in a single context. However, all of its strength lies in the secrecy of the key. This becomes a challenge when we want to move the data from one place to another. If we have to share the key, especially with multiple recipients, it’s no longer a secret.
To meet this need, asymmetric algorithms rely upon a pair of linked keys that are generated with specific properties. When one key encrypts a piece of information, only the corresponding key in the pair can decrypt it. This type of encryption is also called public-key cryptography because often (not always), one key is made public while the other is kept private.
The math that makes this pairing possible is interesting, but what web developers need to understand is when to use it and what protection it provides. We most commonly encounter the technology in SSL (now called TLS) connections. A web server sends its public key to a web browser, which uses it to encrypt data that only the server can decrypt. It can also be used for sending encrypted e-mail.
Compared with symmetric functions, asymmetric ones are slow and require keys that are several times longer to be effective. In TLS connections, the browser and server only use public-key cryptography to establish a temporary symmetric key that they can use to encrypt subsequent communication.
These functions fulfill a crucial role in the modern web experience, however, by allowing us to protect data in transit between an application and its users. The prevalence of open WiFi makes this a very real concern. On open WiFi networks, users broadcast everything they’re doing in a 360-degree radius for a considerable distance. Without encryption, that data can be easily observed by anyone with a laptop. Two high-profile incidents highlighted this risk in 2010. First, Google ran afoul of privacy authorities by accidentally collecting and storing unencrypted WiFi traffic with its Street View cars. What Google did accidentally, others can do on purpose. Later the same year, the Firesheep plugin made headlines by showing just how simple (and damaging) it can be to eavesdrop on an open network.
At minimum, web applications should require TLS connections when transmitting login information. Using them for all traffic is even better.
Understanding the risk
Given the way people use computers today, all of these technologies will only become more important to web developers. The past few years have shown that the risk is not academic. It’s not enough to say “my application doesn’t have a high enough profile to be a target,” because attacks are frequently automated, not targeted. At the same time, we have to understand and educate others about specific risks and how we use technology to address them. Without that education, clients may consider a site or application “secure” simply because it accepts HTTPS connections, not understanding why it’s unable to e-mail a clear-text password to them.
Security can never be absolute. But used properly, modern cryptography provides us with the tools to eliminate the biggest threats. It’s up to us to put them to use.