DB-id’s to “readable” URLs (as in tinyurl, youtube)

Posted: June 18th, 2009 | Author: | Filed under: PHP | Comments Off on DB-id’s to “readable” URLs (as in tinyurl, youtube)

eyetestTiny urls come in handy in many different occations, such as twitter, chat messages and emails. For example, sharing this article with your friends could mean sending them the url http://nostylesheets.com/2009/06/17/db-ids-to-readable-urls-as-in-tinyurl-youtube, which is long. This length could be a problem, e.g. when the link is split over multiple lines in an email: the mail client may not recognize the continued link on the next line. A tiny url can be a solution to this: http://bit.ly/dirNA links to the same article and is much shorter. Several online services are available online, such as bit.ly or tinyurl.com.

This article is a follow-up on a post by Kevin van Zonneveld on how to create these tiny urls yourself using PHP, for example in your CMS or twitter-application. Here, I will provide some information on the theory behind creating a unique (and reversable) mapping from (lengthy) url to tiny url. I will also alter Kevin’s example code to improve for speed.

Having urls or the content behind them in a database table means there is a unique identifier for this content. These id’s are usually (unsigned) integers, i.e. a base-10 number (represented by a succession of characters from a 10 character-alphabet [0-9]). Numbers can be represented by smaller (larger) strings by using a larger (smaller) alphabet, e.g. hex [0-9A-F] (binary [0,1]).

Tiny urls can be created using the integers identifying the content behind the urls. By representing the integers using a larger alphabet. A blog post by Kevin van Zonneveld shows an example on how to undertake such an endeavour. The code below is an adaptation of his function.

The alphabet used by Kevin consists of 62 characters; all the alphanumerics [a-zA-Z0-9]. By adding adding two characters (any two, please feel free to pick) we give ourselves 64 characters. Trivial? I should think not: whereas the other algorithm uses the bcpow() and other mathematics to switch between base-10 and base-62 integer encoding, using base-64, we can use the faster bit-shifting operations since 64 equals 26.

define('ALPHA_ID_BET', 'abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ()'); // length to 64 = 2^6
define('ALPHA_ID_BET_BASE', strlen(ALPHA_ID_BET));

function toInt($in){
	$out = 0;
	$len = strlen($in) - 1;
	$iPow = 1 << (6*$len);
	for ($t = 0; $t <= $len; ++$t){
		$out = $out + strpos(ALPHA_ID_BET, $in{$t}) * $iPow;
		$iPow = $iPow << 6;
	}
	return $out;
}
function fromInt($in){
	$out = '';
	$index = ALPHA_ID_BET;
	$iPow = 1 << (6 * ((int) log($in, ALPHA_ID_BET_BASE) + 1));
	while ($iPow > 0){
		$out = $out . $index{$in / $iPow};
		$iPow = $iPow >> 6;
	}
	return $out;
}

This bit-shifting leaves a faster algorithm which, together with some other optimisations (use ++$i rather than $i++ etc), yield an average runtime of about half the original. Granted, I did remove the padding functionality that guarantees a minimal length of the tiny url. The code to compare both implementations:

define('ENOUGH', 1<<20);
$i = 0;
$fStart = microtime(true);
while (++$i < ENOUGH){
	if (toInt(fromInt($i)) !== $i){
		die('Error (no reversable mapping) at '.$i);
	}
}
echo microtime(true) - $fStart;
$i = 0;
$fStart = microtime(true);
while (++$i < ENOUGH){
	if (alphaID(alphaID($i), true) !== $i){
		die('Error (no reversable mapping) at '.$i);
	}
}
echo microtime(true) - $fStart;

Limitations

The adaptation does have a setback however. Bit-shifting works well on integers, not so well on floats. The thing is that as there are no unsigned ints in php, the range of identifiers that can be used in this implementation runs from 0 to 231-1 (depending on your system), unless you use negative ints as id’s :). php converts integers that are too large to float (leaving precission problems when expecting integers; try Kevin’s implementation on 238328) and this kills bit-shifting. For those of you that do not find this appearant: an int is a number (an integer), a float represents a number: it is an IEEE standard. Just moving around its bits will have highly unexpected results when not carefull.

Anyway, good fun for experimenting and usable for (relatively) modest id-spaces.


Comments are closed.