Part 1, Part 2
Nowadays computers keep huge amount of data. However the retrieval of textual information is difficult when is misspelled or not exactly known.
This article is devoted to approximate string-matching algorithms. It gives a simplified but clear description of algorithms with examples.
First, we are going to explain the procedures constructing the phonetic codes for the searched text that sound the same, but spelled differently. Then we explain procedures that give different types of textual similarity metric used in a fault-tolerant search. In conclusion, the algorithms are compared so that the reader can choose the right one. See also resources later in this article.
Comments and questions are welcome: please reach me at alexandre.rodichevski@chiappani.it.
The procedures discussed here are used to simplify searches in databases when one knows how the text is pronounced but not how it is spelled. For this purpose the phonetic codes are constructed for the searched text, while the database is preventively indexed using those codes. Surnames that sound the same, but are spelled differently, like SMITH and SMYTH, have the same code and can be filed together. The use of phonetic codes reduces matching problems from wrong or different spelling.
Phonetic codes are very useful in order to match lists of either personal names or enterprise names. They are also used for speech recognition and in search engines of databases like the anti-terrorist ones.
Soundex builds a phonetic code for people's name. The resulting code is made of one letter and three digits, each digit being one of six consonant sounds. The algorithm was first applied in the U.S. census in 1880.
In 1985 the new Daitch-Mokotoff Soundex algorithm was applied for phonetic coding of Slavic and Yiddish surnames with similar pronunciation and different spelling. The most important improvements are the followings: the code is composed of six characters; there are two different codes when a name has two possible pronunciations; the code is composed of ten figures from 0 to 9.
The New York State Identification and Intelligence System was developed in 1970 through rigorous empirical analysis. In this algorithm a phonetic code of up to 6 letters is constructed.
The algorithm phonetically codes words by reducing them to 16 consonant sounds: B, X, S, K, J, T, F, H, L, M, N, P, R, 0, W, Y. Zero represents the "th" sound; X stands for "sh".
Double metaphone is the improved version of Metaphone. This algorithm phonetically codes words by reducing them to 12 consonant sounds. It returns two keys if a word has two possible pronunciations.
The Caverphone phonetic matching algorithm was created in the Caversham Project at the University of Otago in New Zealand in 2002. The algorithm allows accents present in the study area (southern part of the city of Dunedin, New Zealand).
A new version, Caverphone 2.0, has been built as a more general phonetic match.
Approximate string-matching algorithms
Part 1, Part 2