-
Type: New Feature
-
Status: Resolved
-
Priority: Major
-
Resolution: Fixed
-
Affects Version/s: V4.02
-
Fix Version/s: V4.02
-
Component/s: Protocol, URL Conventions
-
Labels:None
-
Proposal:
Introduction
This is a proposal to introduce phonetic comparison functionality to Open Data Protocol. The foundation of this feature relies on the service implementing a phonetic algorithm for indexing strings by sound, such as SOUNDEX which indexes strings according to English pronunciation.
The goal is for homophones to be encoded to the same representation so they can be matched despite minor differences in spelling them, then expose that through RESTful API OData calls.
This proposal will navigate through the details of the feature and its potential implementation in OData.
SOUNDEX Algorithm
The SOUNDEX converts an alphanumeric string to a four-character code that is based on how the string sounds when spoken. The first character of the code is the first character of character expression, converted to upper case. The second through fourth characters of the code are numbers that represent the letters in the expression. The letters A, E, I, O, U, H, W, and Y are ignored unless they are the first letter of the string. Zeroes are added at the end if necessary, to produce a four-character code.
For example, the name Michelle and Michael both return SOUNDEX value of M240, while David for instance will return a SOUNDEX value of D130 which makes Michael a more of a nearly sounding match to Michelle than David.
Rules
SOUNDEX follows the NARA coding rules which are as follows:
1. Coding consists of a letter followed by three numerals. Examples: L123, C472, S160.
2. The first letter of a surname is not coded, it is retained as the initial letter.
3. A, E, I, O, U, Y, W, and H are not coded.
4. Double letters are coded as one letter (as in Lloyd).
5. Prefixes to surnames like "van", "Von", "Di", "de", "le", "D", "dela" or "du" are sometimes disregarded in coding.
6. Code the following letters to three digits, using 0 at the end if needed.
SOUNDEX system is based on a coding guide as represented in the following table:
Number Represents the Letters
1 B, F, P, V
2 C, G, J, K, Q, S, X, Z
3 D, T
4 L
5 M, N
6 R
Not Coded A, E, I, O, U, Y, W, H