Yesterday at work, I had to try to knock together a quick application to parse a database table and pull out records with a matching name. Simple at first glance, but more complex when you think about the common mis-spellings and abbreviations of names. For example, Jonathan Miller may quite rightly be represented as Jon Miller, John Miller, Jo Miller and so on. What I needed was a way to disambiguate these names for matchin purposes.
Conventional wisdom here would suggest the use of soundex patterns, and I agree that there are compelling arguments for this approach. However, being the contrary soul that I am I decided soundex wasn’t quite good enough and went looking for alternatives.
I came across this blog post outlining a SourceForge project from the Web Intelligence Group at the University of Sheffield. I took 20 minutes to implement the pre-requisites to use the patterns, and have to say it seems to give me what I want.
I’m using the Jaro Winkler metric to provide the fuzzy matching I’m looking for, and I am also able to give the users a choice of the confidence level of the match. A confidence level of 1 will only return data that matches exactly. Confidence level 0 would return everything. A bit of trial andd error showed me that a confidence level of around 0.85 produced the best result for my purposes.
An example of the query I used would be something like this…
PROCEDURE [dbo].[GetResultsByFuzzyName] @FamilyName varchar(35), @GivenName varchar(35), @DateOfBirth datetime, @certaintyLevel int AS BEGIN SET NOCOUNT ON; DECLARE @cert float; SET @cert = CAST(@certaintyLevel as float) /10 select <myFields> ,dbo.JaroWinkler(upper(familyname),upper(@FamilyName)) as FamilyNameScore ,dbo.JaroWinkler(upper(Givenname),upper(@GivenName)) as GivenNameScore from <myTableName> where dbo.JaroWinkler(upper(familyname),upper(@FamilyName)) >= @cert and dbo.JaroWinkler(upper(givenname),uppeR(@GivenName)) >= @cert order by dbo.JaroWinkler(upper(familyname),upper(@FamilyName)) desc, dbo.JaroWinkler(upper(givenname),upper(@GivenName)) desc END
I could go further with this and do a bit more analysis with a second (or even third) matching algorithm, but for now I’m getting pretty good results.