URL shorteners have been in the news lately, after the Libyan government, who owns the .ly domain that many URL shorteners like bit.ly and ow.ly are built on, has asked vb.ly, the self-proclaimed world’s only sex-positive URL shortener, to take down their site for violation of Libyan moral standards.
So, how do URL shorteners work, exactly?
I always thought I knew how they work. I mean, it can’t be very hard – what do you normally use to turn a long string into a short, associated string? A hash function.
Hash functions are used in all kinds of lookups that have to be done on large values, database indexes being a prime example. In short, if I calculate the md5 hash function on this sentence (without the hash itself, of course!), I would get the value:
Adding a space to the end of the same sentence yields the value:
As you can see, a hash function outputs a unique short value for every unique long value. It is theoretically impossible to calculate a reverse hash function (ie, get the sentence from the value), but it is trivial to, given a sentence, compare its hash value to a previously stored hash value, and verify that the sentence that resulted in the stored hash value was the same as the given one.
After spending some time on this, though, I realized that a hash function is not the best way to approach URL shortening. Why?
No reverse function
A hash function does not have a reverse function. In other words, it is impossible to calculate the input to a hash function, given the output. It’s a one-directional process.
This makes hash functions ideal for certain security related processes, like password input. A user’s browser can calculate a hash of a password the user enters into a system and then send the hash to the server. The server only stores the hash and can determine if the user has entered the right password, without having to know the user’s password and without the user’s password ever having to leave the browser.
In URL shortening this functionality is not ideal though. We want to be able to look up a URL given the short URL, so we need a simple reverse function, calculating the long URL from the short one. Otherwise we have to store the hash with the URL, which defies the purpose of calculating it.
Most widely available hash functions, like MD5, SHA1, SHA256, etc output a value between 32 and 64 hexadecimal digits wide. That is between 128 and 256 bits. This would result in a BASE64 encoding string of between 22 and 43 characters long. This is clearly too long for URL shorteners.
So – what is the solution?
It’s actually quite simple. Instead of using a complex cryptographically evolved hash function, why not let the database do the work for you? It is custom to create database tables with a numeric field that auto-increments, often called index, or something similar. It’s part of good database design to include such a field, to facilitate easy lookup. This field would automatically be set to 1 for the first entry, 2 for the second, etc.
This index can easily be converted to a string in the following way: let’s assume we’re working with the character set a-z, A-Z, and 0-9. That’s a total of 62 characters. Map it as follows:
1 = A
2 = B
26 = Z
27 = a
28 = b
52 = z
53 = 0
54 = 1
62 = 9
63 = AA
64 = AB
This is a simple, easily implementable function, which is also very easy to calculate in reverse and results in the shortest possible string, given the character set.
Posted by Adriaan Pelzer