Sentry Spelling Checker Engine for Java

Home Site index Contact us Catalog Shopping Cart Products Support Search

You are here: Home > Products > Developer tools > Sentry Spelling Checker Engine > Java > Technical

Technical details

Click here for more information about Sentry Spell Checker Engine for Java

Contents:

Synopsis
Using the Sentry API
Sentry class details

Synopsis

Packaging: Java class library compatible with JDK 1.x and Java 2 (J2SE and J2EE)
Performance: Checks spelling at over 20000 words per second. on modest hardware.
Class library size: 93K
Dictionary file size: Average of .5Mb per language
Run-time memory requirements: Typically 1 - 2 Mb.
Platforms supported: Any JVM-compliant platform.

Using the Sentry API

The Sentry API (application program interface) provides spelling-checker capabilities without a user interface. An application that uses the Sentry engine provides (if necessary) a user interface for the user to dispose of any detected spelling errors. Although the Sentry engine itself contains no user interface and does not directly communicate with user-interface components (such as TextArea or JTextArea), it does provide mechanisms for communicating with such components through interface classes. Also, a set of spelling related user interface components (e.g., dialog boxes) are provided in the examples included with the Sentry Java SDK. You can incorporate these components into your applications, modifying them as necessary to suit your application's needs.

Sentry Class Overview

Following is a brief description of the classes and interfaces that form the Sentry engine.

CompressedLexicon: (Class) Provides access to compressed lexicon (.clx) files or streams. Normally, you would construct a CompressedLexicon from a file or stream, then pass the CompressedLexicon object to SpellingSession to check the spelling of text against it. If you use the PropSpellingSession class, you don't need to construct a CompressedLexicon object at all; PropSpellingSession does that for you.

EditableLexicon: (Interface) A Lexicon which words can be added to or removed from.

EnglishPhoneticComparator: (Class) A WordCompartor derivative that compares one word with another, producing a score indicating the degree of closeness between the two words. The degree of closeness is based on phonetic similarity using English pronunciation. The phonetic signature of each word is obtained using the "Metaphone" algorithm. The comparison works only for English pronunciation. This class can be passed to the SpellingSession.suggest method to locate suggestions based on English pronunciation.

FileLexicon: (Interface) A file-based lexicon.

FileTextLexicon: (Class) A file-based, editable lexicon in text format. When words are added to or removed from a FileTextLexicon, the associated file is automatically updated.

HTMLStringWordParser: (Class) A WordParser derivative that parses words from a String while ignoring any HTML markups that may be present.

LexCompressor: (Class) A class that builds a compressed lexicon from one or more word list files. Normally, you would use the SqLex program to do this, but LexCompressor can also be used if you need to create compressed lexicons from your application at run time.

Lexicon: (Interface) A searchable collection of words. The Lexicon interface specifies methods to determine if a word exists and to obtain suggested replacements.

MemTextLexicon: (Class) A memory-based, editable lexicon in text format. MemTextLexicons are not persistent, meaning any words added to them will be lost when the object goes out of scope. MemTextLexicon lexicons are useful for storing words marked "Ignore All" and "Replace All" by the user.

PropSpellingSession: (Class) An extension of SpellingSession that is initialized from a java.util.Properties collection. PropSpellingSession reads the Properties collection to open lexicons and set spelling options.

SpellingSession: (Class) The spelling engine. You will use SpellingSession (or PropSpellingSession) directly or indirectly to check the spelling of text and look up suggested replacements for misspelled words.

StreamTextLexicon: (Class) A stream-based, editable lexicon in text format. StreamTextLexicons are loaded from InputStreams and the contents can be saved to OutputStreams.

StringWordParser: (Class) A WordParser derivative that identifies words in a String. StringWordParser also includes methods to replace and delete words in the string, as might happen when misspelled words are corrected.

SuggestionSet: (Class) A collection of suggested replacements for a misspelled word. The number of words contained by a SuggestionSet is fixed. When a candidate word is submitted for inclusion in the set, SuggestionSet will reject the word (silently) if the words already in the set are better suggestions, or will "bump" and existing suggestion if the new one is better. As a result, SuggestionSet maintains the best set of suggestions.

TypographicalComparator: (Class) A WordComparator derivative that compares one word against another, producing a score that indicates how closely one word resembles the other (in terms of the characters contained within each word). This class can be passed to the SpellingSession.suggest method to locate suggestions based on typographical similarity.

WordComparator: (Interface) An interface that defines a means of determining how closely matched one word is with another. WordComparators are used when searching for suggested replacements for misspelled words.

WordParser: (Interface) An interface that provides a means of accessing and updating a collection of words. WordParser is derived from java.util.Enumeration, and a WordParser enumerates the words in some collection of words. (For example, StringWordParser enumerates the words contained by a String.) WordParser also includes methods to replace and delete words specific words in the collection, as would happen when misspelled words are corrected. By creating new WordParser derivatives, the Sentry engine can be used to check the spelling of words contained within objects and components that do not yet exist.

How to check a single word

To check the spelling of a single word, call SpellingSession's check method. The check method searches for the word in all open lexicons. If it finds the word in a lexicon, it returns a code indicating that the word is correctly spelled. If the check method fails to find the word in any open lexicon, it returns a code indicating the word is misspelled.

The check method checks the spelling of a word against all open lexicons. The check method returns a check-result mask, a bit-mask indicating how your application should respond. The example program tests the MISSPELLED_WORD_RSLT bit in the check-result mask. MISSPELLED_WORD_RSLT means the word was not found in any of the lexicons, and so is considered to be misspelled. In some cases, the bit mask returned by the check method indicates an action requiring another word. The other word is returned to your application as the check method's (optional) second parameter.

How to obtain suggestions

Typically, spelling checkers offer suggested replacements for misspelled words. The suggest method of the SpellingSession class is used to locate suggestions for a misspelled word from words in the lexicon.

The first parameter to the suggest method is the word for which suggestions are needed. Normally, but not necessarily, this word will be misspelled. The second parameter controls the depth of the search. The depth parameter can range from 1 (shallow but fast) to 100 (deep but slow). The deeper the search, the more likely the correct spelling will be located, particularly if the error occurs near the start of the word.

The third parameter is a class used to determine how closely one word matches another. The Sentry class library includes two such classes: TypographicalComparator, which matches words using typographical (looks like) matching, and EnglishPhoneticComparator, which matches words using phonetic (sounds like) matching. The "Suggestions" property in the Properties object passed to PropSpellingSession determines which comparator is used to locate suggestions.

The TypographicalComparator class is the best choice for general use. The EnglishPhoneticComparator class may be better suited for users who are very poor spellers, such as children or people learning English as a second language. Note that the TypographicalComparator class is usable with any language, while EnglishPhoneticComparator is usable only with English.

The fourth parameter holds the best set of suggestions collected by the suggest method. In the example, the SuggestionSet holds the 10 best suggestions. You can call the suggest method several times using different depths or comparators with the same SuggestionSet; if better suggestions are found, they will replace less acceptable suggestions so the SuggestionSet will always contain the best suggestions. The words in the SuggestionSet are ordered such that the first suggestion is the best, and remaining suggestions are typically less likely to be the correct replacement.

Because the suggest method takes more time to execute than the check method, your application should call the suggest method only after the check method reports that a word is misspelled.

The suggest method fills the SuggestionSet object with the best suggestions found, given the search depth and the nature of the misspelling. The WordComparator object produces a score ranging from 0 to 100 that describes how closely the suggestion matches the misspelling. The words in the SuggestionSet object are ordered by decreasing word score -- i.e., the first word is the most likely replacement, and the last word is the least-likely replacement.

Note that the suggest method cannot determine the one correct replacement for a misspelled word, nor can it guarantee it will find the correct replacement. People misspell words for a variety of reasons, including dropped letters, extra letters, transposed letters, replaced letters, or simply a mistaken understanding of the correct spelling. Some of these errors may result in "words" that have several perfectly reasonable replacements. For example, the misspelling "flas" could be "flag," "flask," "flash," etc. Because of the ambiguity, the suggest method locates several alternative words, and ranks them based on their similarity to the misspelled word. The suggest method produces the best results with longer words containing a single error.

Your application can examine the scores in the SuggestionSet object to eliminate words that are unlikely to be satisfactory replacements, using one or more of the following suggested approaches:

Eliminate any word whose score falls below some threshold -- e.g., 60%.
Accept only words in the top percentage range -- e.g., the top word and any other word whose score falls within 20% of the top word.
If the difference between a word's score and the score of the word preceding it exceeds some threshold (e.g., 20%), eliminate that word and all following words.

How to check a string of words

Often, an application needs to check a string of words: a sentence, paragraph, or an entire document. The SpellingSession class includes a check method that accepts as a parameter a class that extracts words from some text source. The Sentry engine includes the StringWordParser class that is used to extract words from Strings, and HTMLStringWordParser that extracts words from strings containing HTML markups.

You may already be familiar with the class StringTokenizer in the java.util package. Like StringWordParser, StringTokenizer can extract words from Strings. But StringWordParser is a little more sophisticated than StringTokenizer. For example, StringWordParser treats most punctuation as a word delimiter (StringTokenizer can do that too, but you'd have to explicitly list each of the hundreds of punctuation characters in the Unicode character set). Also, StringWordParser knows about initialisms -- abbreviations formed from the first letters of words, terminated with periods, such as "R.C.M.P.". StringWordParser treats initialisms as words. StringWordParser deals intelligently with the special characters appearing in e-mail addresses and URLs, too.

StringWordParser can also make changes to words in strings via its deleteWord and replaceWord methods. These methods facilitate checking and correcting text.

The check method of SpellingSession checks the form and spelling of each word in turn, stopping when it finds a word which requires the attention of your application or it reaches the end of the text. Your application should call the check method in a loop; the loop terminates when the check method indicates that the end of the string has been reached (by returning END_OF_TEXT_RSLT). If the check method returns because it has encountered a problem with a word (e.g., a misspelling), your application should respond to the problem (e.g., by replacing the misspelled word in the string with a correctly spelled word), then continuing with the loop in which the check method is called. Note that if the reported word is not corrected, the check method will immediately report the same word when it is called again. To prevent this, call StringWordParser's nextWord method to skip over the reported word.

The check method returns a bit mask which your application can examine to determine how to respond. Some responses require another word, which is returned to your application as the second parameter to the check method.

Associated with each StringWordParser is a cursor that marks the current position within the string. When the StringWordParser is initially constructed, the cursor is set to the start of the string. The cursor is advanced by calling StringWordParser's nextWord method. When SpellingSession's check method returns to report a problem with a word, the StringWordParser's cursor is positioned at the word's first character. Other methods in StringWordParser act on the word at the cursor position. The getWord method obtains the word at the cursor position. The cursor position can be obtained via the getCursor method, and it can be changed via the setCursor method. To ignore (skip) a word reported by the check method, call StringWordParser's nextWord method.

The AUTO_CHANGE_WORD_RSLT bit in the result mask returned by the check method indicates to your application that a word in the string should be automatically (i.e., without user intervention) replaced with another word. This result is returned when a word in the string matches a word in a text lexicon marked with AUTO_CHANGE_ACTION or AUTO_CHANGE_PRESERVE_CASE_ACTION. In response, your application calls StringWordParser's replaceWord method to replace the current word with the other word. The question of what to do with the cursor then arises. The simplest approach is to advance the cursor past the replacement word (by calling StringWordParser's nextWord method). The problem with this approach is that the replacement word may be misspelled. If the cursor is advanced past it, the string will contain misspellings that are never reported. The obvious solution is to leave the cursor where it is, so the replacement word will be checked the next time the check method is called. But this introduces even more serious problems. What if the original word and replacement word are the same? Suppose the word xyz was added to a text lexicon with AUTO_CHANGE_ACTION and the other word was also xyz. If the cursor was not advanced, the check method would tell your application to replace xyz with xyz indefinitely. The endless loop could be broken by advancing the cursor if the original and replacement words happened to be the same. But what if the recursive replacements were indirect? Suppose xyz was auto-replaced with abc, and abc was auto-replaced with xyz. The check method would tell your application to replace xyz with abc, then abc with xyz, then xyz with abc, and so on. Two potential solutions to these problems are listed below. You can use the approach that makes the most sense for your application:

Limit the number of automatic replacements made at the same cursor position to some small number such as five. If the count is exceeded, treat the automatic replacement as a conditional replacement.
Check the spelling of the replacement word separately using the check method, and treat the replacement as conditional if it is misspelled (conditional replacements require the user's confirmation). If the replacement word is correctly spelled, apply the replacement automatically and leave the cursor where it is.

The Sentry engine can detect doubled words: the same word appearing twice in a row. In some cases a repeated word is an error. The error can be corrected by deleting the second occurrence of the word using StringWordParser's deleteWord method. To keep the formatting correct, the deleteWord method deletes the word and any white space characters before the word. The deleteWord method returns the cursor position (offset) of the first deleted character, which is handy when undoing changes made to strings.

When the SPLIT_HYPHENATED_WORDS_OPT option is set, the check method checks the spelling of words containing hyphens as follows. First, the word is checked in its entirety, hyphens included. If the word is found in an open lexicon, it is processed normally, so no special action takes place. If the word is not found, the check method breaks the word into sub-words at the hyphens and checks the spelling of each sub-word individually. If all sub-words are correctly spelled, the entire word is skipped over and no further special action takes place. If any of the sub-words is misspelled, the check method returns to your application. The StringWordParser's cursor is set to the offset of the first character of the misspelled sub-word. Calling StringWordParser's getWord method returns just the misspelled sub-word, not the entire hyphenated term. The StringWordParser constructor accepts a parameter that indicates whether hyphens should be treated as word delimiters. When SPLIT_HYPHENATED_WORDS_OPT is enabled, this parameter passed to the StringWordParser constructor should be false.

A similar process happens when the SPLIT_CONTRACTED_WORDS_OPT option is set and the word contains apostrophes.

If the text to be checked contains HTML markups, use the HTMLStringWordParser class instead of StringWordParser. HTMLStringWordParser ignores any text surrounded by angle brackets (<...>), and also skips any text following an ampersand (&) until a semicolon (;) is encountered. The initial character of each markup type (the < and &) must be present in the String passed to HTMLStringWordParser's constructor or the text may be interpreted as normal words and possibly reported as misspelled. When HTMLStringWordParser is used, the String must contain valid HTML. For example, a less-than sign (<) must be represented using the < character entity code.

How to check text from other sources

In some applications, the text to be checked may be contained within complex data structures or it may be in a particular form that requires specific parsing rules.

The solution in each of these cases is to create your own class that implements the WordParser interface.

The WordParser interface is based on the Enumeration (java.util.Enumeration) interface. A key role of a WordParser-derived class is to enumerate the words contained within some text source. The WordParser interface also defines other useful functions, including identifying the position of words within the text and modifying the text (e.g., to replace a misspelled word with the correct spelling).

One of SpellingSession's check methods accepts a WordParser-derived object as a parameter. The check method invokes the object's getWord and nextWord methods to extract words from the text source and to advance from word to word. When the check method indicates that a misspelled word has been found, your application can call the WordParser object's methods to determine which word was misspelled and where it is located in the text source. If desired, your application can call the WordParser object's replaceWord method to replace the misspelled word with another word, or it can call the nextWord method to skip (ignore) the misspelled word.

The WordParser-derived class must perform whatever steps are necessary to carry out the actions requested by each of these methods. A class derived from WordParser must keep track of the cursor, which is the zero-based position (offset) of the start of the current word. The getWord method must return the current word (i.e., the word at the cursor position). The class must be able to identify a word in the text at or after the cursor position, meaning it must be able to distinguish characters that form a word (e.g., letters) from characters that separate words (e.g., spaces or punctuation). When the getWord method is called, the cursor may not be pointing at the first character of a word. It might be necessary to advance the cursor over any non-word characters in the text until the beginning of a word is found. Similarly, the nextWord method must advance past the current word, then search through the text for the beginning of the next word.

The StringWordParser class does a good job of identifying words in text. The easiest way to create a WordParser-derived class that enumerates words in other sources is to extend the new class from StringWordParser. This way, you will capitalize on all the features made available by StringWordParser. (Alternatively, if the text source may contain text with HTML markups, you can base your new class on the HTMLStringWordParser class.) This is in fact what TextAreaWordParser and JTextComponentWordParser (WordParser derivatives used in examples provided with Sentry Java SDK) do: Both of these classes extend from StringWordParser.

For this approach to work, your StringWordParser derivative must be able to do the following:

Obtain the contents of the text source as a String;
Access a word in the text source using its position within the String. For example, if a particular word starts at position 30 in the String and is five characters long, it should be possible to replace the word in the text source by deleting five characters from position 30-34, then inserting a new word at position 30.

The JTextArea component, and other Swing components derived from JTextComponent, meet both of these criteria. The JTextComponentWordParser class will be used as an example of the technique presented here.

JTextComponentWordParser extends (i.e., is a subclass of) StringWordParser. JTextComponentWordParser's constructor calls the StringWordParser constructor, passing the text contained by the JTextComponent whose contents are being enumerated. JTextComponentWordParser's constructor also keeps a private reference to the JTextComponent being enumerated. This will be used later to modify the JTextComponent's contents when misspellings are corrected.

JTextComponentWordParser overrides the following StringWordParser methods:

deleteText
insertText
replaceText

Each of the overridden methods performs an edit operation on the portion of the JTextComponent containing the current word. Each overridden method also invokes the base class (StringWordParser) method to perform the corresponding operation on the String containing a copy of the JTextComponent's contents.

JTextComponentWordParser defines a new public method called highlightWord, which is not in the base StringWordParser class. The highlightWord method selects the current word in the JTextComponent. The highlightWord method is called by the application (in this case, JTextAreaInteractiveDemo) to highlight the misspelled word in the JTextArea so the user can spot the word easily.

By employing this approach, a new WordParser-derived class can be created to enumerate and update other kinds of text sources, such as components, containers, and documents.

Sentry Class Details

The Sentry Spelling-Checker Engine's API provides direct, straightforward calling from Java applications, applets, and servlets.

Sentry's application program interface consists of various Java classes:

Class SpellingSession: Performs general-purpose spell checking of words and strings. Major methods include:

check: Check the spelling of a word or String

getLexicons, setLexicons: Get/set the set of lexicons (dictionaries) used to check spelling.

getOption, setOption: Get/set option values:

CASE_SENSITIVE_OPT: Enable if words with different letter-case patterns should be treated as different words.

IGNORE_ALL_CAPS_WORD_OPT: Enable if checked words consisting entirely of upper-case letters should be ignored.

IGNORE_CAPPED_WORD_OPT: Enable if checked words should be ignored if they begin with an upper-case letter.

IGNORE_DOMAIN_NAMES_OPT: Enable to ignore (skip) words that appear to be Internet domain names.

IGNORE_MIXED_CASE_OPT: Enable if checked words containing an unusual mixture of upper- and lower-case letters should be ignored.

IGNORE_MIXED_DIGITS_OPT: Enable if checked words containing a mixture of letters and digits or other symbols should be ignored.

REPORT_DOUBLED_WORD_OPT: Enable if two occurrences of the same word in a row should be reported.

REPORT_MIXED_CASE_OPT: Enable if checked words containing an unusual combination of upper- and lower-case letters should be reported.

REPORT_MIXED_DIGITS_OPT: Enable if checked words containing a combination of letters and digits or other symbols should be reported.

REPORT_UNCAPPED_OPT: Enable if check words should be reported whose first character is not capitalized.

SPLIT_CONTRACTED_WORDS_OPT: Enable if apostrophes should if necessary be as word separators, and each sub-word spell checked individually.

SPLIT_HYPHENATED_WORDS_OPT: Enable if hyphens ("-") as should if necessary be treated as word separators, and each sub-word spell checked individually.

SPLIT_WORDS_OPT: Enable if words should if necessary be treated as a series of concatenated sub-words, and each sub-word spell checked individually.

STRIP_POSSESSIVES_OPT: Enable if possessives of the form xxx's and xxxs' should be removed from words before checking their spelling.

SUGGEST_SPLIT_WORDS_OPT: Enable if suggest() should attempt to split words into two valid sub-words.

suggest: Locate suggested alternate spellings for a misspelled word

Classes FileTextLexicon, StreamTextLexicon, and MemTextLexicon represent permanent (file or stream based) or temporary (memory based) lexicons (dictionaries). Major methods include:

addWord: Add a word to the lexicon.

deleteWord: Remove a word from the lexicon.

words: Enumerate the words in the lexicon.

Classes StringWordParser and HTMLStringWordParser are used to access and edit the words contained in a String. HTMLStringWordParser is used to spell check HTML, skipping over the markups and checking just the text. Major methods include:

deleteText: Delete a specified number of characters from the text starting at the current cursor position.

deleteWord: Delete the word at the cursor position.

getCursor: Obtain the current cursor position (the position of the current word), expressed as an offset from the start of the text.

getNumReplacements: Get the number of words replaced so far.

getWord: Obtain the word at the WordParser's current cursor position.

insertText: Insert text at a specified position.

isDoubledWord: Determine if the current word and the previous word are identical, and that no punctuation appears between them.

nextWord: Obtain the current word and advance to the next word.

replaceWord: Replace the word at the current position with a new word.

setCursor: Set the cursor to a given position.

toString: Convert the text to String form.

Click here for more information about Sentry Spell Checker Engine for Java

Home Site index Contact us Catalog Shopping Cart Products Support Search