SOUNDEX Alternatives Part 2: Dealing with NYSIIS Discrepancies

In my previous post I mentioned the difficulty in finding consistent implementations of the NYSIIS algorithm. Upon further reading, it appears the ambiguity in some of the rule descriptions is a known problem. The issue was documented as far back as January 1977 in the National Bureau of Standards Special Publication 500-1, page 24:

“The NYSSIS[sic] coding rules are, in several instances, subject to different interpretations. As a result, implementation in software may differ slightly from one system to another.”

Despite these differences the NYSIIS has persevered as a useful text searching tool. It’s easy to code (in a variety of interpretations) and has relatively few rules and simple operations for the CPU to process while encoding. So most versions should be fairly quick to execute; which is, obviously, useful when processing large volumes of data.

It should also be noted that between any two systems, it may be acceptable to have different encoding results as long as the encoding is consistent within each of them. That is, if my system always encodes Akerman to ACARNAN. That value works as a repeatable search criteria. If your system always encodes it to be ACARNA that will work for you because you will also be able to search your results consistently. The variations in implementation only cause a problem if there is an attempt to share encodings or if the implementation is so far off that it skews the encoded groupings into uselessness.

With that realization I now understood it is not possible to define a “correct” version anymore. The algorithm’s outline spread faster than it could be standardized. Below are the results of versions I tested, including my own PL/SQL implementation. Fortunately there is often a majority agreement and about 60% of the time there is unanimous agreement between them. I’ve coded my version to always achieve at least a plurality of votes between the samples I worked with.

Original	Plurality Agreement	DIFFER/MATCH 25/36	PL/SQL	Python Fuzzy	R Phonics	RosettaCode Python	Apache Commons Codec
BART	BAD	Match	BAD	BAD	BAD	BAD	BAD
BISHOP	BASAP	Match	BASAP	BASAP	BASAP	BASAP	BASAP
BOWMAN	BANAN	Match	BANAN	BANAN	BANAN	BANAN	BANAN
BROWN	BRAN	Match	BRAN	BRAN	BRAN	BRAN	BRAN
BROWNE	BRAN	Match	BRAN	BRAN	BRAN	BRAN	BRAN
CARLSON	CARLSAN	Differ	CARLSAN	CARLSAN	CARLSA	CARLSAN	CARLSAN
CARR	CAR	Match	CAR	CAR	CAR	CAR	CAR
CARRAWAY	CARY	Differ	CARY	CARAY	CARY	CARY	CARY
CASSTEVENS	CASTAFAN	Differ	CASTAFAN	CASTAFAN	CASTAF	CASTAFAN	CASTAFAN
CHAPMAN	CAPNAN	Match	CAPNAN	CAPNAN	CAPNAN	CAPNAN	CAPNAN
DEUTSCH	DAT	Differ	DAT	DATS	DAT	DAT	DAT
FRANKLIN	FRANCLAN	Differ	FRANCLAN	FRANCLAN	FRANCL	FRANCLAN	FRANCLAN
FRAZIER	FRASAR	Match	FRASAR	FRASAR	FRASAR	FRASAR	FRASAR
GREENE	GRAN	Match	GRAN	GRAN	GRAN	GRAN	GRAN
HARPER	HARPAR	Match	HARPAR	HARPAR	HARPAR	HARPAR	HARPAR
HEITSCHMIDT	HATSNAD	Differ	HATSNAD	HATSNAD	HATSNA	HATSNAD	HATSNAD
HUNT	HAD	Match	HAD	HAD	HAD	HAD	HAD
HURD	HAD	Match	HAD	HAD	HAD	HAD	HAD
JACOBS	JACAB	Match	JACAB	JACAB	JACAB	JACAB	JACAB
JEREMIAH	JARAN	Differ	JARAN	JARAN	JARAN	JARANAH	JARAN
JILES	JAL	Differ	JAL	JAL	JAL	JALA	JAL
KNIGHT	NAGT	Match	NAGT	NAGT	NAGT	NAGT	NAGT
KNUTH	NAT	Differ	NAT	NATH	NAT	NAT	NAT
KOEHN	CAN	Match	CAN	CAN	CAN	CAN	CAN
KUHL	CAL	Match	CAL	CAL	CAL	CAL	CAL
LARSON	LARSAN	Match	LARSAN	LARSAN	LARSAN	LARSAN	LARSAN
LAWRENCE	LARANC	Match	LARANC	LARANC	LARANC	LARANC	LARANC
LAWSON	LASAN	Match	LASAN	LASAN	LASAN	LASAN	LASAN
LOUIS	L	Differ	L	L	L	LA	L
LYNCH	LYNC	Differ	LYNC	LANCH	LYNC	LYNC	LYNC
MACINTOSH	MCANT	Differ	MCANT	MCANTAS	MCANT	MCANTA	MCANT
MACKENZIE	MCANSY	Match	MCANSY	MCANSY	MCANSY	MCANSY	MCANSY
MACKIE	MCY	Match	MCY	MCY	MCY	MCY	MCY
MATTHEWS	MAT	Differ	MAT	MATAW	MAT	MATA	MAT
MCCORMACK	MCARNAC	Differ	MCARNAC	MCARNAC	MCARNA	MCARNAC	MCARNAC
MCDANIEL	MCDANAL	Differ	MCDANAL	MCDANAL	MCDANA	MCDANAL	MCDANAL
MCDONALD	MCDANALD	Differ	MCDANALD	MCDANALD	MCDANA	MCDANALD	MCDANALD
MCKEE	MCY	Match	MCY	MCY	MCY	MCY	MCY
MCKNIGHT	MCNAGT	Match	MCNAGT	MCNAGT	MCNAGT	MCNAGT	MCNAGT
MCLAUGHLIN	MCLAGLAN	Differ	MCLAGLAN	MCLAGHLAN	MCLAGL	MCLAGLAN	MCLAGLAN
MITCHELL	MATCAL	Match	MATCAL	MATCAL	MATCAL	MATCAL	MATCAL
MORRISON	MARASAN	Differ	MARASAN	MARASAN	MARASA	MARASAN	MARASAN
OBANION	OBANAN	Match	OBANAN	OBANAN	OBANAN	OBANAN	OBANAN
OBRIEN	OBRAN	Match	OBRAN	OBRAN	OBRAN	OBRAN	OBRAN
ODANIEL	ODANAL	Match	ODANAL	ODANAL	ODANAL	ODANAL	ODANAL
OWSLEY	OSLY	Differ	OSLY	OWSLY	ASLY	OASLY	OSLY
PFEISTER	FASTAR	Match	FASTAR	FASTAR	FASTAR	FASTAR	FASTAR
PHILLIPSON	FALAPSAN	Differ	FALAPSAN	FALAPSAN	FALAPS	FALAPSAN	FALAPSAN
RAWSON	RASAN	Match	RASAN	RASAN	RASAN	RASAN	RASAN
RICHARDS	RACARD	Differ	RACARD	RACAD	RACARD	RACARD	RACARD
RICKERT	RACAD	Match	RACAD	RACAD	RACAD	RACAD	RACAD
SCHOENHOEFT	SANAFT	Match	SANAFT	SANAFT	SANAFT	SANAFT	SANAFT
SHRIVER	SRAVAR	Match	SRAVAR	SRAVAR	SRAVAR	SRAVAR	SRAVAR
SILVA	SALV	Match	SALV	SALV	SALV	SALV	SALV
VASQUEZ	VASG	Differ	VASG	VASG	VASG	VASGA	VASG
WATKINS	WATCAN	Match	WATCAN	WATCAN	WATCAN	WATCAN	WATCAN
WESTERLUND	WASTARLAD	Differ	WASTARLAD	WASTARLAD	WASTAR	WASTARLAD	WASTARLAD
WESTPHAL	WASTFAL	Differ	WASTFAL	WASTFAL	WASTFA	WASTFAL	WASTFAL
WHEELER	WALAR	Match	WALAR	WALAR	WALAR	WALAR	WALAR
WILLIS	WAL	Differ	WAL	WAL	WAL	WALA	WAL
YAMADA	YANAD	Match	YANAD	YANAD	YANAD	YANAD	YANAD

I extended my test to a larger list of 162253 unique names pulled from the 2010 U.S. Census results. With this larger set the differing vs all matching remained around the same as the smaller sample with 60% of the names being encoded the same way by all implementations. For the remaining 40% I was still able to achieve a successful matching with a plurality values with my implementation. I am content with those results. My version appears usable and true to the general idea of the algorithm; but I can not state mine is “correct” or even that it is “better” than the others. The PL/SQL version produces odd results for some inputs. They are consistent with my reading of the encoding rules; but still seem odd when found in the output. In particular, some names become encoded to an empty string, which is NULL in Oracle. Even though those results are consistent; an encoding that strips away all of the data still seem “wrong” to me. It is possible to implement a search for a null value; but it’s going to be contrived.

Furthermore, a matching of implementation plurality is only confirmed for the set I tested. If I were to gather more samples across other languages and tools I might find even more disagreement in implementation results; and, again, due to the ambiguities in the algorithm description it might not be possible to say they are right or wrong.

At this juncture I will set aside the NYSIIS algorithm and move on to other phonetic encodings. In Part 3 I’ll start into the Metaphone family of algorithms.

Leave a ReplyCancel reply