Part 1, Part 2
Other procedures discussed here use different types of textual similarity metric. They can be used in a fault-tolerant search.
Most of those procedures calculate the similarity of two strings as number between 0 and 1. Value 0 means the strings are completely different. Value 1 means perfect match of the strings. Intermediate values correspond to a partial match.
The similarity metric algorithms are useful for spell checking, speech recognition, DNA analysis, plagiarism detection, to find difference between files and for remote screen update problem.
For two strings of the same length, the Hamming distance is the number of positions in which the two strings have different characters.
The edit distance of two strings is defined as the minimum number of insertions, deletions and substitutions required to transform the first string into the second. The distance 0 means the strings are identical.
The algorithm was devised in 1965 by the sovietic scientist Vladimir Levenshtein. The edit distance is called also Levenshtein distance.
There are modern versions of Edit distance algorithm assigning different weights to insertions, deletions and substitutions.
The trigram similarity between two strings is determined by the number of matching letter triples in both strings. The algorithm can be generalized to n-grams, where n=1, 2, ...
The strings string1 and string2 are padded to left and to right by one space symbol. Then the strings are divided into letter triplets (trigrams). Finally, the similarity is calculated as:
s = 2*m / (a + b).
Here:
m is number of matching trigrams
a is number of trigrams in string1
b is number of trigrams in string2
The Ratcliff/Obershelp algorithm computes the similarity of two strings as the doubled number of matching characters divided by the total number of characters in the two strings. Matching characters are those in the longest common subsequence plus, recursively, matching characters in the unmatched region on either side of the longest common subsequence.
The Jaro-Winkler similarity was developed at the U.S. Census and used in post-enumeration analysis.
Given strings string1 and string2, their similarity is:
s = m/3a + m/3b + (m-t)/3m.
Here:
m is number of matching characters
a is length of string1
b is length of string2
t is number of transpositions
Two characters are considered matched only if they are no further apart than (max(a,b)/2 - 1). The first matched character on string1 is compared to the first matched character on string2; the second matched character on string1 is compared to the second matched character on string2, etc. The number of mismatched characters is divided by two to yield the number of transpositions.
The improved Jaro-Winkler method uses weights different from 1/3. It also penalizes less some types of error: visual scanning and keypunch errors, errors at the end of a string.
There are some algorithms taking into account errors resulting from the pressing of erroneous buttons on the keyboard: V instead of B, 6 instead of Y etc.
The choice of one from the following string-matching algorithms mainly depends on the nature of error influencing the text data.
| Algorithm | Type | When to use |
|---|---|---|
| Soundex | phonetic | misspelled English words |
| Daitch-Mokotoff | phonetic | differently spelled European surnames |
| NYSIIS | phonetic | misspelled English and some foreign words |
| Metaphone, Double metaphone | phonetic | misspelled English words |
| Caverphone 2.0 | phonetic | misspelled English words |
| Hamming, Edit distance | similarity | local spelling errors |
| Trigram, n-gram | similarity | spelling errors, edited text |
| Ratcliff/Obershelp | similarity | spelling errors, edited text |
| Jaro-Winkler | similarity | spelling and typewriting errors |
Approximate string-matching algorithms
Part 1, Part 2