Building a Mini URL Service – Part 2 – The Algorithm
Part 1 – Part 2
The first order of business is what URL shortening approach should be used to take some very long URL, which in IE7 is limited to 2,083 characters (KB208427) and provide a nice compact link.
The first part of the link (protocol + server + port) is generally controlled by what domain name you can get – for me, my little demo is http://MyMiniUrl.net. The rest of the URL – the path is something in your control.
Doing a search I came across a few approaches but settled on a Base 62 approach that uses a sequence generation maintained in the persistence tier. The idea is to generate a unique sequential number, then convert that numeric to a Base 62 representation. For the sequence generation, I just relied on the DB layer (MySql or MS SQL) to generate this from an identity column.
Once that identity value is generated, it gets run through a Base 62 conversion to a string representation. That string along with the identity (from the DB), the full Long URL, and a cryptographic hash of the Long URL is stored in the DB. The basic DB table schema is as follows (for MySQL):
Column | Type | Description |
urlId | Bigint(20) | Identity column |
miniUrl | Char(12) | Shortened “path” of the URL |
fullUrlHash | Char(32) | Crypto hash of Full URL using MD5 |
fullUrl | Varchar(4096) | Full URL provided |
The indices are as follows:
- Primary – index on urlID
- fullUrlHash – Unique index
- miniUrl – I left this as “not unique” given my persistence pattern starts off with this value as null.
So, when the URL service is asked to create a short URL, it first checks to see if the URL was already generated. To do that the UrlService uses the basic pattern:
- Checks the URL pattern to a matching regular expression (in the config file)
- Generates a MD5 hash of the full URL
- Checks to see if the hash already exists doing a SQL lookup on the hashed value of the full URL
- If exists, just return the existing shortened URL
- If doesn’t exist
- Insert new Long URL, Hash of URL
- Get new identity key
- Convert new identity key to Base 62
- Return short URL using Base 62 representation
Now, for the Base 62 algorithm, I looked around at a few approaches, and Chris had a good post on various approaches as well – Friendly Unique Id generation.
Starting with his code, I also found another approach located here, then finalized on the following:
static int baseNum = 62; private static readonly String baseDigits = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"; public static string Base62ToString(long fromValue) { string toValue = fromValue == 0 ? "0" : ""; int mod = 0;while (fromValue != 0) { mod = (int)(fromValue % baseNum); //should be safe toValue = baseDigits.Substring(mod, 1) + toValue; fromValue = fromValue / baseNum; } return toValue;
}
The basic approach is to loop through the source value, grab the remainder, convert that remainder to Base 62 and append it to a string return value, until there’s nothing left.
So, in terms of “string” vs. StringBuilder performance, I also tried using StringBuilder in place of string concatenation, but performance, in a loop of a billion iterations was far better (about 60%) just with simple string concatenation. Now, I’m not too concerned with garbage collection at this point, I just wanted something quick and efficient – and in the end correct.