Yesterday at work, I had to try to knock together a quick application to parse a database table and pull out records with a matching name. Simple at first glance, but more complex when you think about the common mis-spellings and abbreviations of names. For example, Jonathan Miller may quite rightly be represented as Jon Miller, John Miller, Jo Miller and so on. What I needed was a way to disambiguate these names for matchin purposes.
Conventional wisdom here would suggest the use of soundex patterns, and I agree that there are compelling arguments for this approach. However, being the contrary soul that I am I decided soundex wasn’t quite good enough and went looking for alternatives.
I came across this blog post outlining a SourceForge project from the Web Intelligence Group at the University of Sheffield. I took 20 minutes to implement the pre-requisites to use the patterns, and have to say it seems to give me what I want.
I’m using the Jaro Winkler metric to provide the fuzzy matching I’m looking for, and I am also able to give the users a choice of the confidence level of the match. A confidence level of 1 will only return data that matches exactly. Confidence level 0 would return everything. A bit of trial andd error showed me that a confidence level of around 0.85 produced the best result for my purposes.
An example of the query I used would be something like this…
PROCEDURE [dbo].[GetResultsByFuzzyName]
	@FamilyName varchar(35),
	@GivenName varchar(35),
	@DateOfBirth datetime,
	@certaintyLevel int
AS
BEGIN
  SET NOCOUNT ON;
  DECLARE @cert float;
  SET @cert = CAST(@certaintyLevel as float) /10 
    
	select 
    <myFields>
    ,dbo.JaroWinkler(upper(familyname),upper(@FamilyName)) as FamilyNameScore
    ,dbo.JaroWinkler(upper(Givenname),upper(@GivenName)) as GivenNameScore
  from 
    <myTableName> 
  where
    dbo.JaroWinkler(upper(familyname),upper(@FamilyName)) >= @cert
  and
    dbo.JaroWinkler(upper(givenname),uppeR(@GivenName))  >= @cert
  order by 
    dbo.JaroWinkler(upper(familyname),upper(@FamilyName)) desc,
    dbo.JaroWinkler(upper(givenname),upper(@GivenName)) desc
	
END
I could go further with this and do a bit more analysis with a second (or even third) matching algorithm, but for now I’m getting pretty good results.