Fuzzy-Matching of names

Yesterday at work, I had to try to knock together a quick application to parse a database table and pull out records with a matching name.  Simple at first glance, but more complex when you think about the common mis-spellings and abbreviations of names.  For example, Jonathan Miller may quite rightly be represented as Jon Miller, John Miller, Jo Miller and so on.  What I needed was a way to disambiguate these names for matchin purposes.

Conventional wisdom here would suggest the use of soundex patterns, and I agree that there are compelling arguments for this approach.  However, being the contrary soul that I am I decided soundex wasn’t quite good enough and went looking for alternatives.

I came across this blog post outlining a SourceForge project from the Web Intelligence Group at the University of Sheffield.  I took 20 minutes to implement the pre-requisites to use the patterns, and have to say it seems to give me what I want.

I’m using the Jaro Winkler metric to provide the fuzzy matching I’m looking for, and I am also able to give the users a choice of the confidence level of the match.  A confidence level of 1 will only return data that matches exactly. Confidence level 0 would return everything.  A bit of trial andd error showed me that a confidence level of around 0.85 produced the best result for my purposes.  

An example of the query I used would be something like this…

PROCEDURE [dbo].[GetResultsByFuzzyName]
	@FamilyName varchar(35),
	@GivenName varchar(35),
	@DateOfBirth datetime,
	@certaintyLevel int

  DECLARE @cert float;
  SET @cert = CAST(@certaintyLevel as float) /10 
    ,dbo.JaroWinkler(upper(familyname),upper(@FamilyName)) as FamilyNameScore
    ,dbo.JaroWinkler(upper(Givenname),upper(@GivenName)) as GivenNameScore
    dbo.JaroWinkler(upper(familyname),upper(@FamilyName)) >= @cert
    dbo.JaroWinkler(upper(givenname),uppeR(@GivenName))  >= @cert
  order by 
    dbo.JaroWinkler(upper(familyname),upper(@FamilyName)) desc,
    dbo.JaroWinkler(upper(givenname),upper(@GivenName)) desc

I could go further with this and do a bit more analysis with a second (or even third) matching algorithm, but for now I’m getting pretty good results.