This project involved multiple departments and collaboration of a D.B.A., Project Manager & Myself (Developer).
Current process: Uses the Levenshtein Distance formula to determine possible duplicate data in our system. After a user inputs information for a new record, we also use googles API's to validate addresses and provide a google sourced address.
The Levenshtein Distance formula measures the amount of updates, inserts and deletions it would take to get one string to match another. For example 'cat' to 'bats' would include 1 update from 'c' to 'b' and 1 insertion of 's' leading to the matching string 'bats'. The distance would be 2 as it took 1 update and 1 insertion to get the 2 strings to match.
Like many other companies, we used this method for finding possible matches and making our choices for our users somewhat smart. Our issues were that the way we measured the matches and ranking them for our users to choose from.
Introducing a Smarter Matching Algorithm: Although the Levenshtein Distance formula was an effective way to find possible matching data, we were still allowing our users to create new records. We opted to use tabular comparisons with a 2 char match. We also scrubbed the data with common words and removed all spaces. So 'Taylorsville High School' would be reduced to 'Taylorsville' as we would not evaluate High School as those are common words and would create a higher match based on those words alone. Without writing a novel about our process and how we refined it to return smarter matching results we were now able to evaluate by %. We are now able to set thresholds of disabling the creation of new data into our system.