Yesterday at work, I had to try to knock together a quick application to parse a database table and pull out records with a matching name. Simple at first glance, but more complex when you think about the common mis-spellings and abbreviations of names. For example, Jonathan Miller may quite rightly be represented as Jon Miller, John Miller, Jo Miller and so on. What I needed was a way to disambiguate these names for matchin purposes.
Conventional wisdom here would suggest the use of soundex patterns, and I agree that there are compelling arguments for this approach. However, being the contrary soul that I am I decided soundex wasn’t quite good enough and went looking for alternatives.
I came across this blog post outlining a SourceForge project from the Web Intelligence Group at the University of Sheffield. I took 20 minutes to implement the pre-requisites to use the patterns, and have to say it seems to give me what I want.
I’m using the Jaro Winkler metric to provide the fuzzy matching I’m looking for, and I am also able to give the users a choice of the confidence level of the match. A confidence level of 1 will only return data that matches exactly. Confidence level 0 would return everything. A bit of trial andd error showed me that a confidence level of around 0.85 produced the best result for my purposes.
An example of the query I used would be something like this…
PROCEDURE [dbo].[GetResultsByFuzzyName]
@FamilyName varchar(35),
@GivenName varchar(35),
@DateOfBirth datetime,
@certaintyLevel int
AS
BEGIN
SET NOCOUNT ON;
DECLARE @cert float;
SET @cert = CAST(@certaintyLevel as float) /10
select
<myFields>
,dbo.JaroWinkler(upper(familyname),upper(@FamilyName)) as FamilyNameScore
,dbo.JaroWinkler(upper(Givenname),upper(@GivenName)) as GivenNameScore
from
<myTableName>
where
dbo.JaroWinkler(upper(familyname),upper(@FamilyName)) >= @cert
and
dbo.JaroWinkler(upper(givenname),uppeR(@GivenName)) >= @cert
order by
dbo.JaroWinkler(upper(familyname),upper(@FamilyName)) desc,
dbo.JaroWinkler(upper(givenname),upper(@GivenName)) desc
END
I could go further with this and do a bit more analysis with a second (or even third) matching algorithm, but for now I’m getting pretty good results.