How many words should be in the spell checker's dictionary?

Wintertree Software's American and British English dictionaries each contain about 100,000 words. We're frequently asked if this is enough. Sometimes customers or potential customers call and mention that a competitor's product comes with a dictionary containing 130,000 words, or 150,000 words, or some other large number, and they want to know why Sentry's dictionary doesn't contain that many.

The short answer to this question may surprise you: The other dictionaries probably contain too many words. Read on to find out why.

You probably know that spell checkers work by checking words against a dictionary containing words known to be correct. If a word isn't found in the dictionary, the word is reported as a misspelling. If a word is found, it is skipped over without being reported. Two key measures of a spell checker's accuracy are its detection rate, which is the number of misspelled words reported vs. the number of words actually misspelled, and the false-positive rate, which is the number of valid words incorrectly reported as misspelled vs. the number of words checked. A high detection rate and a low false-positive rate are desirable.

The number of words in the dictionary has a strong bearing on both of these measures. If the dictionary contains too many words, the probability will increase that a misspelled word will match one of the words in the dictionary, and therefore will not be reported. This will decrease the spell checker's detection rate. If the dictionary contains too few words, more valid words will be reported because they aren't in the dictionary. This will increase the spell checker's false-positive rate.

The ideal dictionary for you would contain every word in your vocabulary, but no other words. This dictionary would yield an excellent detection rate and a false-positive rate of 0%. The detection rate would not be 100% because you could still misspell a word and match a different valid word -- you might accidentally leave the e off stare and match star, for example. The false-positive rate would be 0% because every word reported by the spell checker must necessarily be misspelled, a condition which would remain true until you learned a new word.

Unfortunately, a dictionary that is ideal for you would likely be less than ideal for someone else, since different people have different vocabularies. Moreover, creating a dictionary containing the words in only one person's vocabulary would be prohibitively expensive. A cost-effective dictionary contains the words most commonly used by the population of its users.

To maintain a high detection rate, the dictionary should contain only words common to a large portion of the population. If the dictionary contains technical terms used only by the small portion of the population who are taxidermists, for example, there is an increased chance that a misspelling made by an average user will match one of these specialized terms and therefore not be reported.

To maintain a low false-positive rate, the dictionary should contain most of the words used by the population. If the dictionary does not contain a word commonly used by the population, people will experience frustration when the spell checker reports the word as a misspelling.

Incidentally, a dictionary in a spell checker isn't like a paper dictionary such as Webster's or the OED. Paper dictionaries have an obligation to include as many words, no matter how obscure, as possible. One could even argue that paper dictionaries should focus on obscure words and not waste space on common words such as the, or, and of, since most users of a language will have an intuitive understanding of the meanings of these words.

Of the two measures, the detection rate is more important. A spell checker that flags valid words as misspellings may be annoying, but a spell checker that allows a misspelled word to pass through without report has failed to do its job. For this reason, the dictionary should contain as many common words as are needed to maintain a reasonable false-positive rate, but no more. Putting it another way, the dictionary should contain the minimum number of words needed to avoid incorrectly reporting common valid words.

This is the goal Wintertree Software has established for our dictionaries. We build our dictionaries by statistically analyzing vast amounts of text from many sources to ensure that the most common words -- and only the most common words -- are included, with words ranging from the, a, and of to less common but still far from obscure words like plenipotentiary and disenfranchisement. Even a person with a large vocabulary is unlikely to use a word not in our American or UK English dictionary, unless that word is a highly specialized technical term, such as the name of a disease or a rare insect. Specialized terms are best handled by supplemental dictionaries, and we carry medical and legal dictionaries for just this purpose. We could easily dump words willy-nilly into our dictionaries, beating our competitors' counts by hundreds of thousands of words, but that would serve only to lower our detection rate. The count of words in a spell checker's dictionary is like body weight: Once an optimum level has been achieved, adding or taking away will just make things worse.

So the next time you come across a company offering a spell checker with a dictionary containing 150,000 or more words, ask them one question: Why?

Copyright © 2015 Wintertree Software Inc.