Searchresults with misspelled Names

Posted: June 23rd, 2009 | Author: | Filed under: PHP | Comments Off on Searchresults with misspelled Names

search resultsOften, we like to present our users with the best care possible. Unfortunately, when these users fill in the forms in applications, they tend to make type-o’s. This leaves us to correct them and “interpret” what our dear users meant. Obviously, in the interest of time (but more importantly: work), this task should be automated. As I sometimes think it is really difficult to interpret what other people mean, so may computers. This article is about correcting type-o’s in city names.


In order to correct type-o’s in the city names provided by users, we have used a database table containing all the possible names of systems in the application. It takes two values: the actual name and a cannonical version used to match input against. In this cannonical representation, all non-literals have been stripped:

	
$asCityName = array();	
preg_match_all('/([a-z]+)/', strtolower($sCityName), $asCityName);	
$sCityName = implode('', $asCityName[0]);

When a user provides input, possibly misspelled, e.g. forgot (Amterdam, no s) a letter or swaps letters (Asmterdam), a cannonical representation is created which is matched to its cannonical counterpart in the database. In this matching, we make two assumptions on the intelligence of the user:

  1. (s)he should have at least the first letter right. This is done to make the set of possible matches smaller, improving speed. You could add more assumptions, e.g. the last letter is right etc.
  2. the length of the cannonical string should be within two letters smaller or greater than the match. This implies that a user can add or forget two letters in the name (actually, it means more, e.g. add four letters, forget two etc, but you get my drift).

Next, each possible match is scored against the input. A character by character comparision is made and with each equality a point is added to the score. These comparisons run from the start of the input string as well as from the end. This allows for a max score of twice the cannonical length of the input (for correctly spelled names). Finally, an ordered list [city-name => score] is returned (better score at lower index).

function find($sInput){
	$oDB = new mysqli(...);

	$asInput = array();
	preg_match_all('/([a-z]+)/', strtolower($sInput), $asInput);
	$sInput = implode('', $asInput[0]);

	$iCount = strlen($sInput);
	$oRes = $oDB->query('SELECT * FROM `town`						 WHERE
							`sMatch` LIKE "'.$sInput{0}.'%" AND
							CHAR_LENGTH(sMatch) BETWEEN '.($iCount - 2).' AND '.($iCount + 2));
	$aiScore = array();
	while (list($sTown, $sMatch) = $oRes->fetch_row()){
		$aiScore[$sTown] = 0;
		// sMatch.length not nec. equal to sInput.length
		$iMatchLenght = strlen($sMatch);
		$iLoopSize = min($iCount, $iMatchLenght);
		for ($i = 0; $i < $iLoopSize; ++$i){
			if ($sMatch{$i} === $sInput{$i}){
				++$aiScore[$sTown];
			}
			if ($sMatch{$iMatchLenght - $i} === $sInput{$iCount - $i}){
				++$aiScore[$sTown];
			}
		}
	}
	asort($aiScore, SORT_NUMERIC);
	return array_reverse($aiScore);
}

Final Thoughts

The approach taken is very straightforward. Alterations can easily be made, e.g. by assuming the last letter in a city name to be correct or taking a smaller range in the length of the match in order to lower the number of possible matches. Also this approach requires a db table, meaning the list of possibilities is limited. More elaborate schemes may have users add new words to the table, but the approach will be the same.


Comments are closed.