Close

SOUNDEX Alternatives Part 2: Dealing with NYSIIS Discrepancies

In my previous post I mentioned the difficulty in finding consistent implementations of the NYSIIS algorithm. Upon further reading, it appears the ambiguity in some of the rule descriptions is a known problem. The issue was documented as far back as January 1977 in the National Bureau of Standards Special Publication 500-1, page 24:

“The NYSSIS[sic] coding rules are, in several instances, subject to different interpretations. As a result, implementation in software may differ slightly from one system to another.”

Despite these differences the NYSIIS has persevered as a useful text searching tool. It’s easy to code (in a variety of interpretations) and has relatively few rules and simple operations for the CPU to process while encoding. So most versions should be fairly quick to execute; which is, obviously, useful when processing large volumes of data.

It should also be noted that between any two systems, it may be acceptable to have different encoding results as long as the encoding is consistent within each of them. That is, if my system always encodes Akerman to ACARNAN. That value works as a repeatable search criteria. If your system always encodes it to be ACARNA that will work for you because you will also be able to search your results consistently. The variations in implementation only cause a problem if there is an attempt to share encodings or if the implementation is so far off that it skews the encoded groupings into uselessness.

With that realization I now understood it is not possible to define a “correct” version anymore. The algorithm’s outline spread faster than it could be standardized. Below are the results of versions I tested, including my own PL/SQL implementation. Fortunately there is often a majority agreement and about 60% of the time there is unanimous agreement between them. I’ve coded my version to always achieve at least a plurality of votes between the samples I worked with.

Original Plurality Agreement DIFFER/MATCH 25/36 PL/SQL Python Fuzzy R Phonics RosettaCode Python Apache Commons Codec
BART BAD Match BAD BAD BAD BAD BAD
BISHOP BASAP Match BASAP BASAP BASAP BASAP BASAP
BOWMAN BANAN Match BANAN BANAN BANAN BANAN BANAN
BROWN BRAN Match BRAN BRAN BRAN BRAN BRAN
BROWNE BRAN Match BRAN BRAN BRAN BRAN BRAN
CARLSON CARLSAN Differ CARLSAN CARLSAN CARLSA CARLSAN CARLSAN
CARR CAR Match CAR CAR CAR CAR CAR
CARRAWAY CARY Differ CARY CARAY CARY CARY CARY
CASSTEVENS CASTAFAN Differ CASTAFAN CASTAFAN CASTAF CASTAFAN CASTAFAN
CHAPMAN CAPNAN Match CAPNAN CAPNAN CAPNAN CAPNAN CAPNAN
DEUTSCH DAT Differ DAT DATS DAT DAT DAT
FRANKLIN FRANCLAN Differ FRANCLAN FRANCLAN FRANCL FRANCLAN FRANCLAN
FRAZIER FRASAR Match FRASAR FRASAR FRASAR FRASAR FRASAR
GREENE GRAN Match GRAN GRAN GRAN GRAN GRAN
HARPER HARPAR Match HARPAR HARPAR HARPAR HARPAR HARPAR
HEITSCHMIDT HATSNAD Differ HATSNAD HATSNAD HATSNA HATSNAD HATSNAD
HUNT HAD Match HAD HAD HAD HAD HAD
HURD HAD Match HAD HAD HAD HAD HAD
JACOBS JACAB Match JACAB JACAB JACAB JACAB JACAB
JEREMIAH JARAN Differ JARAN JARAN JARAN JARANAH JARAN
JILES JAL Differ JAL JAL JAL JALA JAL
KNIGHT NAGT Match NAGT NAGT NAGT NAGT NAGT
KNUTH NAT Differ NAT NATH NAT NAT NAT
KOEHN CAN Match CAN CAN CAN CAN CAN
KUHL CAL Match CAL CAL CAL CAL CAL
LARSON LARSAN Match LARSAN LARSAN LARSAN LARSAN LARSAN
LAWRENCE LARANC Match LARANC LARANC LARANC LARANC LARANC
LAWSON LASAN Match LASAN LASAN LASAN LASAN LASAN
LOUIS L Differ L L L LA L
LYNCH LYNC Differ LYNC LANCH LYNC LYNC LYNC
MACINTOSH MCANT Differ MCANT MCANTAS MCANT MCANTA MCANT
MACKENZIE MCANSY Match MCANSY MCANSY MCANSY MCANSY MCANSY
MACKIE MCY Match MCY MCY MCY MCY MCY
MATTHEWS MAT Differ MAT MATAW MAT MATA MAT
MCCORMACK MCARNAC Differ MCARNAC MCARNAC MCARNA MCARNAC MCARNAC
MCDANIEL MCDANAL Differ MCDANAL MCDANAL MCDANA MCDANAL MCDANAL
MCDONALD MCDANALD Differ MCDANALD MCDANALD MCDANA MCDANALD MCDANALD
MCKEE MCY Match MCY MCY MCY MCY MCY
MCKNIGHT MCNAGT Match MCNAGT MCNAGT MCNAGT MCNAGT MCNAGT
MCLAUGHLIN MCLAGLAN Differ MCLAGLAN MCLAGHLAN MCLAGL MCLAGLAN MCLAGLAN
MITCHELL MATCAL Match MATCAL MATCAL MATCAL MATCAL MATCAL
MORRISON MARASAN Differ MARASAN MARASAN MARASA MARASAN MARASAN
OBANION OBANAN Match OBANAN OBANAN OBANAN OBANAN OBANAN
OBRIEN OBRAN Match OBRAN OBRAN OBRAN OBRAN OBRAN
ODANIEL ODANAL Match ODANAL ODANAL ODANAL ODANAL ODANAL
OWSLEY OSLY Differ OSLY OWSLY ASLY OASLY OSLY
PFEISTER FASTAR Match FASTAR FASTAR FASTAR FASTAR FASTAR
PHILLIPSON FALAPSAN Differ FALAPSAN FALAPSAN FALAPS FALAPSAN FALAPSAN
RAWSON RASAN Match RASAN RASAN RASAN RASAN RASAN
RICHARDS RACARD Differ RACARD RACAD RACARD RACARD RACARD
RICKERT RACAD Match RACAD RACAD RACAD RACAD RACAD
SCHOENHOEFT SANAFT Match SANAFT SANAFT SANAFT SANAFT SANAFT
SHRIVER SRAVAR Match SRAVAR SRAVAR SRAVAR SRAVAR SRAVAR
SILVA SALV Match SALV SALV SALV SALV SALV
VASQUEZ VASG Differ VASG VASG VASG VASGA VASG
WATKINS WATCAN Match WATCAN WATCAN WATCAN WATCAN WATCAN
WESTERLUND WASTARLAD Differ WASTARLAD WASTARLAD WASTAR WASTARLAD WASTARLAD
WESTPHAL WASTFAL Differ WASTFAL WASTFAL WASTFA WASTFAL WASTFAL
WHEELER WALAR Match WALAR WALAR WALAR WALAR WALAR
WILLIS WAL Differ WAL WAL WAL WALA WAL
YAMADA YANAD Match YANAD YANAD YANAD YANAD YANAD

I extended my test to a larger list of 162253 unique names pulled from the 2010 U.S. Census results. With this larger set the differing vs all matching remained around the same as the smaller sample with 60% of the names being encoded the same way by all implementations. For the remaining 40% I was still able to achieve a successful matching with a plurality values with my implementation. I am content with those results. My version appears usable and true to the general idea of the algorithm; but I can not state mine is “correct” or even that it is “better” than the others. The PL/SQL version produces odd results for some inputs. They are consistent with my reading of the encoding rules; but still seem odd when found in the output. In particular, some names become encoded to an empty string, which is NULL in Oracle. Even though those results are consistent; an encoding that strips away all of the data still seem “wrong” to me. It is possible to implement a search for a null value; but it’s going to be contrived.

Furthermore, a matching of implementation plurality is only confirmed for the set I tested. If I were to gather more samples across other languages and tools I might find even more disagreement in implementation results; and, again, due to the ambiguities in the algorithm description it might not be possible to say they are right or wrong.

At this juncture I will set aside the NYSIIS algorithm and move on to other phonetic encodings. In Part 3 I’ll start into the Metaphone family of algorithms.