22 May 2014
What's in a Match

Recently I was faced with a problem that I’m sure has been solved many times before, but curiously, I didn’t find any open source, public-domain solutions. A potentially large group of users is given a form that contain a set of input fields into which they can enter information. The fields include first and last name, gender, email address, occupation, employer, a short personal description, a country, state and city, etc. A searcher is then given a set of fields to optionally enter values into that are used to select a resulting set of matching users.

It’s simply a database query, right? A conjunction of simple queries that qualify which records are returned. Well, yes, except for the simple part.

What Won’t Work

The simple solution is to just do string matching. For each field the searcher enters, it is compared to each user field and if it matches exactly, the user is added to the result set of the search.

This turns out to be based on a terrible assumption: that users all specify things the same way. If you give a user a text field, the likelihood of them making the same meaningful unambiguous entries is just about zero. For instance, if I enter my first name as ‘David’ and the searcher specifies ‘Dave’, that should be a match. But if strings are compared exactly, it will miss.

First names are just once facet of the match. The problem as stated actually contains many types of matches that taken together, should select the intended set of users.

Break It Down Now

Examining each field in the match and determining what method should be used to match the content seems to be the best approach.

Field Field Type Method
Prefix Text Input Variant Match
First Name Text Input Variant Root Match
Middle Initial Text Input Ignore
Last Name Text Input Soundex Match
Suffix Text Input Ignore
Gender Select Exact Match
Email Address Text Input Email Component Match
Occupation Text Input Hierarchical Network Match
Employer Text Input Soundex Match
Description Text Input Keyword Match
Country Select Exact Match
State Select Exact Match
City Select Exact Match

I’m interpreting the problem as: any search field that is empty qualifies that match. If the searcher were to just enter a few fields, the result should be all users who matched on just those fields. In the end, all of the users matching on all of the populated search fields would be returned in the result.

Note the Middle Initial and Suffix fields are ignored in matching. The middle initial is usually unknown to a searcher, and the searcher would likely be able to identify a user in the results from other information. The suffix is potentially more matchable, as it could be used for a generational title (Sr, Jr, III, etc.), an academic degree, an honorary or professional title, or a religious order. Again, the searcher is not likely to know this at the time of the search and is more likely to be able to use the information in the results to identify specific individuals.

1. Exact Match

This is exactly what didn’t work - but it will work for some components. Particularly the components where there are a limited number of choices available to the user, and the matching choices are available to the searcher. If the searcher chooses what the user selected, the match qualifies. Perfect for an html select element.

2. Variant Match

The variant match is slightly indecisive as it may depend on other information entered by the user. The Prefix in a name is an honorific - either a common, formal, professional, religious or historical title. The problem here is that a user who is a physician could enter her prefix as Dr, Ms or Mrs - the first prefix is professional, the second two are common and depend on gender.

So this is a tricky one. While the variants can be built at search time and matched against the user information, often it would be better for the searcher to leave this field blank and qualify any match.

3. Variant Root Match

Here is the first name matching problem. The mechanism needs a table of variants and the roots they’re variants of. When the user enters their information, the root of the first name is found (it may be the name itself - a variant not found is considered a root) and recorded.

When the searcher matches on the first name, the root of their entry is found, and the matching is against roots. That way ‘Dave’ and ‘Davey’ match, since they both root to ‘David’ and the match would qualify.

Note this can go the other way. ‘Al’ may be short for the roots ‘Albert’, ‘Alan’, ‘Alfred’, etc. The searcher may look for a specific root, or just ‘Al’ - and the match will qualify.

4. Soundex Match

Surnames are one of the toughest matches because unless they’re common, they could be spelled any sort of way. Soundex matching, gives one a fighting chance to get close.

The Soundex Algorithm is based on taking a string and producing a code that uses the sound each letter to build a code. Then during the match, the codes are compared. If they match - if they are spelled the same or close - the match will qualify.

If the searcher were on the phone and the user told him his last name, and it was unusual, he might be able to still find it based on it’s phonetic classification. If the searcher knows how the name sounds, they’ll likely get a match.

5. Keyword Match

This one can go to a lot of different levels.

Imagine the user does a three sentence bio in the description field. It’s known that the user owns a speedboat. If the searcher entered the keyword ‘speedboat’, that would qualify the match - if the user happened to mention it in the bio, that is. They might have mentioned it in other terms, so exact matches on keywords may not be enough.

To mitigate this, the keywords could be broken down into sets of synonyms. Speedboat might also imply the keyword ‘boat’, which the user may have included. Of course, this sort of treatment would require a full blown thesaurus in the code, and even then it’s still hit or miss. The user might also go the other way: a Baja 35 Outlaw is a nice speedboat. If they mention the boat by make, no generic match may hit.

The one case where this really makes a difference is when looking for someone whose description has been read before and something striking stands out. A word, a phrase, the way something was said. Here, a keyword match should be enough to qualify the match.

6. Hierarchical Network Match

This one can also go to a lot of different levels.

Consider a user who the occupation ‘Wedding Singer.’ A searcher would likely expect ‘Wedding Singer’, ‘Singer’, and ‘Performer’ to match. These occupations are part of a hierarchy that becomes more abstract toward the root.

This sort of matching relies on encoding real-world knowledge into the network, which doesn’t come cheap. Imagine a network of all the occupations and their hierarchical roots - everything from Neurosurgeon to Park Ranger to Telemarketer - it would be expansive.

The alternative for this is one seen on many trade magazine response cards, asking to check off the job closest to the actual position rather than accepting a free-form input. If this normalization is acceptable, selecting from a list would replace text input.

7. Email Component Match

The email component match is a compound match. The user name may be known, the domain name may be known, the domain may be known, some or all. Generally using soundex on the user name and domain name will get past unusual ones, and an exact match on the domain is straightforward, but with the advent of all the new domains coming into existence, that too may need to be soundex matched.

Given the entire email address was likely entered, the searcher may wish to match on just a part to qualify the match. Multiple part matches will generally reduce the size of the result set.

Takeaways

It’s a rough road to try to do a really good match. The end of the mechanism is to ‘and’ together each part of the search (remember, empty search criteria are an automatic qualify that part of the match) to get the result set.