What started as a simple question has led me down an interesting path. I was asked to help with a text searching question and the developer wanted to know if SOUNDEX would work for their name searching.
That kind of work is exactly what SOUNDEX is for so I agreed it could be used but suggested there may be other, better options. SOUNDEX is good, but Oracle Text indexing and the functions of the UTL_MATCH package also provide mechanisms to do fuzzy searching. Furthermore, there are many newer algorithms with more sophisticated phonetic matching than SOUNDEX and pursuing those is where this adventure took me.
The first method I examined was the New York State Identification and Intelligence System, or NYSIIS for short; originally published by Robert L. Taft in “Name Search Techniques”, 1970. The algorithm for it is:
- Translate the leading characters of the name:
- MAC -> MCC
- KN -> NN
- K -> C
- PF or PH -> FF
- SCH -> SS
- Translate the trailing characters of the name:
- EE or IE -> Y
- DT, RT, RD, NT, or ND -> D
- Use the first character of the name as the first character of the return code.
- Loop through the remaining characters from the second to the end, advancing one character at a time (even if you replace more than one character), performing one of the following steps before proceeding to 13.
- If current position is EV then replace with AF
- If current position is E (not followed by V), I, O, or U replace with A
- If current position is Q then replace with G
- If current position is Z then replace with S
- If current position is M then replace with N
- If current position is KN then replace with N
- If current position is K (not followed by N) then replace with C
- If current position is SCH then replace with SSS
- If current position is PH then replace with FF
- If current position is H either not preceded or not followed by a vowel then replace with the preceding character, otherwise keep H
- If current position is W but preceded by a vowel then replace with the preceding character otherwise keep W
- If none of the above then keep the current character.
- If the current character is not the same as the last character of the return code, then add the current character to the return code.
- Increment to the next character
- If the last character of the return code is S, remove it.
- If the last two characters of the return code are AY, replace them with Y.
- If the last character of the return code is A, remove it.
There are several steps but it’s still mostly straightforward. As you might expect there are versions of this algorithm already implemented in a variety of languages and tools. As of this writing, there are 15 different implementations found at Rosetta Code. Plus there are versions in packages in shared libraries such as Python’s “Fuzzy” in PyPi and “phonics” for R in CRAN. I didn’t see a PL/SQL implementation, so I decided to write on of my own.
-- New York State Identification and Intelligence System (NYSIIS) Phonetic Encoder FUNCTION nysiis(p_name IN VARCHAR2, p_max_length IN PLS_INTEGER DEFAULT 0) RETURN VARCHAR2 DETERMINISTIC IS v_name VARCHAR2(32767) := clean(p_name); v_key VARCHAR2(32767); v_prev VARCHAR2(1); v_next VARCHAR2(1); BEGIN -- Translate first characters CASE WHEN v_name LIKE 'MAC%' THEN v_name := 'MCC' || SUBSTR(v_name, 4); WHEN v_name LIKE 'KN%' THEN v_name := 'NN' || SUBSTR(v_name, 3); WHEN v_name LIKE 'K%' THEN v_name := 'C' || SUBSTR(v_name, 2); WHEN v_name LIKE 'PH%' THEN v_name := 'FF' || SUBSTR(v_name, 3); WHEN v_name LIKE 'PF%' THEN v_name := 'FF' || SUBSTR(v_name, 3); WHEN v_name LIKE 'SCH%' THEN v_name := 'SSS' || SUBSTR(v_name, 4); ELSE NULL; END CASE; -- Translate last characters CASE SUBSTR(v_name, -2) WHEN 'EE' THEN v_name := SUBSTR(v_name, 1, LENGTH(v_name) - 2) || 'Y'; WHEN 'IE' THEN v_name := SUBSTR(v_name, 1, LENGTH(v_name) - 2) || 'Y'; WHEN 'DT' THEN v_name := SUBSTR(v_name, 1, LENGTH(v_name) - 2) || 'D'; WHEN 'RT' THEN v_name := SUBSTR(v_name, 1, LENGTH(v_name) - 2) || 'D'; WHEN 'RD' THEN v_name := SUBSTR(v_name, 1, LENGTH(v_name) - 2) || 'D'; WHEN 'NT' THEN v_name := SUBSTR(v_name, 1, LENGTH(v_name) - 2) || 'D'; WHEN 'ND' THEN v_name := SUBSTR(v_name, 1, LENGTH(v_name) - 2) || 'D'; ELSE NULL; END CASE; -- Save the first character of the name as the start of the return key v_key := SUBSTR(v_name, 1, 1); -- Iterate through each remaining position in the string. -- For each position check if it matches a substring to be replaced. FOR i IN 2 .. LENGTH(v_name) LOOP CASE WHEN SUBSTR(v_name, i, 2) = 'EV' THEN replace_substring(v_name, i, 'AF'); WHEN SUBSTR(v_name, i, 1) IN ('E', 'I', 'O', 'U') THEN replace_substring(v_name, i, 'A'); WHEN SUBSTR(v_name, i, 1) = 'Q' THEN replace_substring(v_name, i, 'G'); WHEN SUBSTR(v_name, i, 1) = 'Z' THEN replace_substring(v_name, i, 'S'); WHEN SUBSTR(v_name, i, 1) = 'M' THEN replace_substring(v_name, i, 'N'); WHEN SUBSTR(v_name, i, 2) = 'KN' THEN replace_substring(v_name, i, 'N'); WHEN SUBSTR(v_name, i, 1) = 'K' THEN replace_substring(v_name, i, 'C'); WHEN SUBSTR(v_name, i, 3) = 'SCH' THEN replace_substring(v_name, i, 'SSS'); WHEN SUBSTR(v_name, i, 2) = 'PH' THEN replace_substring(v_name, i, 'FF'); WHEN SUBSTR(v_name, i, 1) = 'H' THEN v_prev := SUBSTR(v_name, i - 1, 1); v_next := SUBSTR(v_name, i + 1, 1); -- The rules for handling H are not clear -- what to do if the last character is H, -- thus not naving a "next" character -- to check if it is or is not a vowel. IF v_prev NOT IN ('A', 'E', 'I', 'O', 'U') OR NVL(v_next, '-') NOT IN ('A', 'E', 'I', 'O', 'U') THEN replace_substring(v_name, i, v_prev); END IF; WHEN SUBSTR(v_name, i, 1) = 'W' THEN v_prev := SUBSTR(v_name, i - 1, 1); IF v_prev IN ('A', 'E', 'I', 'O', 'U') THEN replace_substring(v_name, i, v_prev); END IF; ELSE NULL; -- keep current character as is END CASE; -- If the current character is not the same as the last character -- of the return key, then add it to the key IF SUBSTR(v_name, i, 1) != SUBSTR(v_key, -1) THEN v_key := v_key || SUBSTR(v_name, i, 1); END IF; END LOOP; -- If the last letter is an S, remove it v_key := REGEXP_REPLACE(v_key, 'S$'); -- If the last two letters are AY, replace them with Y v_key := REGEXP_REPLACE(v_key, 'AY$', 'Y'); -- If the last letter is an A, remove it v_key := REGEXP_REPLACE(v_key, 'A$'); -- Return key and remaining transcoded name RETURN CASE WHEN p_max_length = 0 THEN v_key WHEN p_max_length > 0 THEN SUBSTR(v_key, 1, p_max_length) ELSE NULL END; END;
Of course, when I completed my implementation I wanted to test it. I’ve found small lists of a few dozen names for various implementations; but no canonical sample set to test against. Lacking that I decided to compare my results against some of these other published implementations.
It was at this point my trek took an unexpected turn.
Using R, I ran this simple test:
> library(phonics) > nysiis('owsley') [1] "ASLY"
Using Python, first with the package “Fuzzy” and then again with the Rosetta Code implementation.
>>> import fuzzy >>> print (fuzzy.nysiis('owsley')) OWSLY ##Rosetta Code function >>> print (nysiis('owsley')) OASLY
And finally, my own result:
SQL> select name_phonics.nysiis('owsley') from dual; NAME_PHONICS.NYSIIS('OWSLEY') ---------------------------------------------------- OSLY
Amazingly each of the 4 implementations generates a different answer. I’d really like to declare my implementation “correct” but I think that’s premature without some definitive or at least convincing corroboration. Plus, even if I did get this one name correct, it’s just one of many differing results between the implementations above. I had considered some or all of the other versions might be a newer “modified-NYSIIS” algorithm; but each of these is purported to be the original so they should all return the same results. Since they don’t – some or all of them must be wrong (EDIT- “wrong” is not so easy to declare – see comments and part 2). To what degree they differ and which is most-correct I don’t know yet. I will continue to test other variants as well. As I noted in the comments of my code, the disposition of a trailing H is not clearly defined and different versions handle it in different ways.
In Part 2 I will post the results of those searches and, ideally, I will start building a list of verified, or at least consistently agreed upon results.
Questions and comments, as always, are welcome; but in this case are particularly encouraged if others can point to canonical testing suites or links to source and results of other implementations.
Hi, Author of the Rosetta Code Python here. I wonder if its down to one of two things:
1: The possible need to vary a strings length within a loop, leading to unfamiliar edge cases.
2: Or the lack of a (clearly needed), test suite with the algorithm description – as you have noted?
I attempted to follow the WP article. Maybe that article needs clarification too?
Thank you for reading and your comments.
In my followup article I made note that the algorithm left some ambiguity in its definition and, for several decades already, has been known to have varying implementations.
To my knowledge, there was never a clarifying publication about the implementation.
So, other than syntax errors issues or obvious transcoding errors of the wrong letters, I’ve reversed my initial declaration that some of them must be “wrong.”
If we had a rigorous definition from the original author or at least some comments and examples to help derive the intent behind the ambiguities then we could- but we do not.
For example –
Using your code but my reading of the rules it’s tempting to say your implementation of rules 5,6,7 is not correct.
Because by my reading, 5,6,7 should be performed sequentially and each resolved individually before going to the next.
In your code you exit on the first condition of the 3 you find.
For example ‘RAYS’ would be returned as ‘RAY’, remove the S, done.
But my reading it should be ‘RY’, remove S for rule 5, rule 6 sees ‘AY’ remaining at the end and replaces it with Y.
Thus, by my reading of the rules, your implementation is wrong – but I can’t say my reading of the rules is correct.
I’ve read other descriptions of the rules that similarly combined those ending transcodings as a single rule instead of three distinct rules.
In those cases, my implementation would be wrong but yours would be right for the endings.
It is surprising that the RosettaCode page doesn’t use the same set of rules for all of the implementations.
I’m showing differences because that’s the point of my writing.
In the Rosetta case though, I thought it was trying to show various means of writing the same algorithm, but the current listings do not all follow the same rules.
So, at this point I’m mostly concerned with an encoding scheme that produces usable results and the adherence to a particular reading of the rules is less important to me.
Also, while NYSIIS is still a recognized and used algorithm, there have been improvements in the phonetic encoding space so if algorithmic rigor is essential then it probably makes sense to explore some of those methods instead.