In my previous post I mentioned the difficulty in finding consistent implementations of the NYSIIS algorithm. Upon further reading, it appears the ambiguity in some of the rule descriptions is a known problem. The issue was documented as far back as January 1977 in the National Bureau of Standards Special Publication 500-1, page 24:
“The NYSSIS[sic] coding rules are, in several instances, subject to different interpretations. As a result, implementation in software may differ slightly from one system to another.”
Despite these differences the NYSIIS has persevered as a useful text searching tool. It’s easy to code (in a variety of interpretations) and has relatively few rules and simple operations for the CPU to process while encoding. So most versions should be fairly quick to execute; which is, obviously, useful when processing large volumes of data.
It should also be noted that between any two systems, it may be acceptable to have different encoding results as long as the encoding is consistent within each of them. That is, if my system always encodes Akerman to ACARNAN. That value works as a repeatable search criteria. If your system always encodes it to be ACARNA that will work for you because you will also be able to search your results consistently. The variations in implementation only cause a problem if there is an attempt to share encodings or if the implementation is so far off that it skews the encoded groupings into uselessness.
With that realization I now understood it is not possible to define a “correct” version anymore. The algorithm’s outline spread faster than it could be standardized. Below are the results of versions I tested, including my own PL/SQL implementation. Fortunately there is often a majority agreement and about 60% of the time there is unanimous agreement between them. I’ve coded my version to always achieve at least a plurality of votes between the samples I worked with.
Original | Plurality Agreement | DIFFER/MATCH 25/36 | PL/SQL | Python Fuzzy | R Phonics | RosettaCode Python | Apache Commons Codec |
---|---|---|---|---|---|---|---|
BART | BAD | Match | BAD | BAD | BAD | BAD | BAD |
BISHOP | BASAP | Match | BASAP | BASAP | BASAP | BASAP | BASAP |
BOWMAN | BANAN | Match | BANAN | BANAN | BANAN | BANAN | BANAN |
BROWN | BRAN | Match | BRAN | BRAN | BRAN | BRAN | BRAN |
BROWNE | BRAN | Match | BRAN | BRAN | BRAN | BRAN | BRAN |
CARLSON | CARLSAN | Differ | CARLSAN | CARLSAN | CARLSA | CARLSAN | CARLSAN |
CARR | CAR | Match | CAR | CAR | CAR | CAR | CAR |
CARRAWAY | CARY | Differ | CARY | CARAY | CARY | CARY | CARY |
CASSTEVENS | CASTAFAN | Differ | CASTAFAN | CASTAFAN | CASTAF | CASTAFAN | CASTAFAN |
CHAPMAN | CAPNAN | Match | CAPNAN | CAPNAN | CAPNAN | CAPNAN | CAPNAN |
DEUTSCH | DAT | Differ | DAT | DATS | DAT | DAT | DAT |
FRANKLIN | FRANCLAN | Differ | FRANCLAN | FRANCLAN | FRANCL | FRANCLAN | FRANCLAN |
FRAZIER | FRASAR | Match | FRASAR | FRASAR | FRASAR | FRASAR | FRASAR |
GREENE | GRAN | Match | GRAN | GRAN | GRAN | GRAN | GRAN |
HARPER | HARPAR | Match | HARPAR | HARPAR | HARPAR | HARPAR | HARPAR |
HEITSCHMIDT | HATSNAD | Differ | HATSNAD | HATSNAD | HATSNA | HATSNAD | HATSNAD |
HUNT | HAD | Match | HAD | HAD | HAD | HAD | HAD |
HURD | HAD | Match | HAD | HAD | HAD | HAD | HAD |
JACOBS | JACAB | Match | JACAB | JACAB | JACAB | JACAB | JACAB |
JEREMIAH | JARAN | Differ | JARAN | JARAN | JARAN | JARANAH | JARAN |
JILES | JAL | Differ | JAL | JAL | JAL | JALA | JAL |
KNIGHT | NAGT | Match | NAGT | NAGT | NAGT | NAGT | NAGT |
KNUTH | NAT | Differ | NAT | NATH | NAT | NAT | NAT |
KOEHN | CAN | Match | CAN | CAN | CAN | CAN | CAN |
KUHL | CAL | Match | CAL | CAL | CAL | CAL | CAL |
LARSON | LARSAN | Match | LARSAN | LARSAN | LARSAN | LARSAN | LARSAN |
LAWRENCE | LARANC | Match | LARANC | LARANC | LARANC | LARANC | LARANC |
LAWSON | LASAN | Match | LASAN | LASAN | LASAN | LASAN | LASAN |
LOUIS | L | Differ | L | L | L | LA | L |
LYNCH | LYNC | Differ | LYNC | LANCH | LYNC | LYNC | LYNC |
MACINTOSH | MCANT | Differ | MCANT | MCANTAS | MCANT | MCANTA | MCANT |
MACKENZIE | MCANSY | Match | MCANSY | MCANSY | MCANSY | MCANSY | MCANSY |
MACKIE | MCY | Match | MCY | MCY | MCY | MCY | MCY |
MATTHEWS | MAT | Differ | MAT | MATAW | MAT | MATA | MAT |
MCCORMACK | MCARNAC | Differ | MCARNAC | MCARNAC | MCARNA | MCARNAC | MCARNAC |
MCDANIEL | MCDANAL | Differ | MCDANAL | MCDANAL | MCDANA | MCDANAL | MCDANAL |
MCDONALD | MCDANALD | Differ | MCDANALD | MCDANALD | MCDANA | MCDANALD | MCDANALD |
MCKEE | MCY | Match | MCY | MCY | MCY | MCY | MCY |
MCKNIGHT | MCNAGT | Match | MCNAGT | MCNAGT | MCNAGT | MCNAGT | MCNAGT |
MCLAUGHLIN | MCLAGLAN | Differ | MCLAGLAN | MCLAGHLAN | MCLAGL | MCLAGLAN | MCLAGLAN |
MITCHELL | MATCAL | Match | MATCAL | MATCAL | MATCAL | MATCAL | MATCAL |
MORRISON | MARASAN | Differ | MARASAN | MARASAN | MARASA | MARASAN | MARASAN |
OBANION | OBANAN | Match | OBANAN | OBANAN | OBANAN | OBANAN | OBANAN |
OBRIEN | OBRAN | Match | OBRAN | OBRAN | OBRAN | OBRAN | OBRAN |
ODANIEL | ODANAL | Match | ODANAL | ODANAL | ODANAL | ODANAL | ODANAL |
OWSLEY | OSLY | Differ | OSLY | OWSLY | ASLY | OASLY | OSLY |
PFEISTER | FASTAR | Match | FASTAR | FASTAR | FASTAR | FASTAR | FASTAR |
PHILLIPSON | FALAPSAN | Differ | FALAPSAN | FALAPSAN | FALAPS | FALAPSAN | FALAPSAN |
RAWSON | RASAN | Match | RASAN | RASAN | RASAN | RASAN | RASAN |
RICHARDS | RACARD | Differ | RACARD | RACAD | RACARD | RACARD | RACARD |
RICKERT | RACAD | Match | RACAD | RACAD | RACAD | RACAD | RACAD |
SCHOENHOEFT | SANAFT | Match | SANAFT | SANAFT | SANAFT | SANAFT | SANAFT |
SHRIVER | SRAVAR | Match | SRAVAR | SRAVAR | SRAVAR | SRAVAR | SRAVAR |
SILVA | SALV | Match | SALV | SALV | SALV | SALV | SALV |
VASQUEZ | VASG | Differ | VASG | VASG | VASG | VASGA | VASG |
WATKINS | WATCAN | Match | WATCAN | WATCAN | WATCAN | WATCAN | WATCAN |
WESTERLUND | WASTARLAD | Differ | WASTARLAD | WASTARLAD | WASTAR | WASTARLAD | WASTARLAD |
WESTPHAL | WASTFAL | Differ | WASTFAL | WASTFAL | WASTFA | WASTFAL | WASTFAL |
WHEELER | WALAR | Match | WALAR | WALAR | WALAR | WALAR | WALAR |
WILLIS | WAL | Differ | WAL | WAL | WAL | WALA | WAL |
YAMADA | YANAD | Match | YANAD | YANAD | YANAD | YANAD | YANAD |
I extended my test to a larger list of 162253 unique names pulled from the 2010 U.S. Census results. With this larger set the differing vs all matching remained around the same as the smaller sample with 60% of the names being encoded the same way by all implementations. For the remaining 40% I was still able to achieve a successful matching with a plurality values with my implementation. I am content with those results. My version appears usable and true to the general idea of the algorithm; but I can not state mine is “correct” or even that it is “better” than the others. The PL/SQL version produces odd results for some inputs. They are consistent with my reading of the encoding rules; but still seem odd when found in the output. In particular, some names become encoded to an empty string, which is NULL in Oracle. Even though those results are consistent; an encoding that strips away all of the data still seem “wrong” to me. It is possible to implement a search for a null value; but it’s going to be contrived.
Furthermore, a matching of implementation plurality is only confirmed for the set I tested. If I were to gather more samples across other languages and tools I might find even more disagreement in implementation results; and, again, due to the ambiguities in the algorithm description it might not be possible to say they are right or wrong.
At this juncture I will set aside the NYSIIS algorithm and move on to other phonetic encodings. In Part 3 I’ll start into the Metaphone family of algorithms.