Wintertree Software Inc.

Sentry Spelling Checker Engine C Source Code

Home Site index Contact us Catalog Shopping Cart Products Support Search

You are here: Home > Products > Developer tools > Sentry Spelling Checker Engine > C Source Code > Technical


Technical details

Contents:

Summary

Application program interface

CheckString

Spellchecks the words in a string -- a phrase, sentence, paragraph, or an entire document. CheckString can also optionally detect uncapitalized words, repeated words, words with embedded digits, etc.

CheckWord

Like CheckString, but spell checks one word at a time.

Suggest

Locates suggested alternative spellings for a misspelled word. Also produces a score, expressed as a percentage, showing the degree of correlation for each alternative word found. The set of words is returned in order of decreasing score, so the first word is the best choice.

ReplaceStringWord, DelStringWord

Replace one word in a string with another (used to correct misspellings), or delete a word in a string (used to delete doubled words).

GetOption, SetOption

Get and set option values. See Options for more information.

CreateLex, OpenLex

Creates a new dictionary (lexicon) or opens an existing one. Any number of dictionaries can be open at once. Once opened, dictionaries are searched automatically by CheckString, CheckWord and Suggest.

AddToLex, DelFromLex

Adds or removes words to/from text (user) dictionaries.

GetLexInfo, GetLex

Obtains information about a dictionary, or obtains a copy of its contents.

CompressLexInit, CompressLexFile, CompressLexEnd

Create a new compressed dictionary from words stored in one or more word list files.

Options

The Sentry engine supports the following options for unprecedented flexibility. All of Sentry's options have reasonable defaults so you need to change or set only a small number of properties to suit your application's requirements.

CASE_SENSITIVE_OPT

Controls whether Sentry checks spelling with regard to differences in letter case (e.g., treating america and America as two separate words) or by ignoring differences in letter case. (e.g., treating america and America as the same word). Keeping this option enabled improves performance (because Wintertree Software's dictionaries are optimized for case-sensitive access) and ensures that capitalization errors are detected. You might disable this option when checking text entered in all-caps, for example.

IGNORE_HTML_MARKUPS_OPT

Controls whether Sentry automatically skips over (ignores) HTML markups appearing in the text being spell checked.

IGNORE_ALL_CAPS_WORD_OPT

Controls whether Sentry automatically skips (ignores) words containing just capital letters. If the text contains many abbreviations or acronyms, setting this option prevents them from being reported as spelling errors.

IGNORE_CAPPED_WORD_OPT

Controls whether Sentry automatically skips (ignores) words starting with a capital letter. If the text contains many proper names, setting this option prevents them from being reported as spelling errors.

IGNORE_MIXED_CASE_OPT

Controls whether Sentry automatically skips (ignores) words which contain a mixture of upper- and lower-case letters. If the text contains variable names, technical jargon, etc. (e.g., YearToDate, FmtString), setting this option prevents them from being reported as spelling errors.

IGNORE_MIXED_DIGITS_OPT

Controls whether Sentry automatically skips (ignores) words which contain embedded digits. When checking general text, enabling this option will prevent strings like product codes from being reported as spelling errors. Sentry is flexible enough that it can be used for purposes such as validating product codes and part numbers, in which case this property would be disabled.

IGNORE_DOMAIN_NAMES_OPT

Controls whether Sentry automatically skips (ignores) words which appear to be Internet domain names, URLs, e-mail addresses, etc.

IGNORE_NON_ALPHA_WORDS_OPT

Controls whether Sentry automatically skips (ignores) words which contain no alphabetic letters. When checking general text, enabling this option will prevent non-alphabetic strings like phone numbers and ZIP codes from being reported as spelling errors. Sentry is flexible enough that it can be used for purposes such as validating part numbers, in which case this property would be disabled.

REPORT_DOUBLED_WORD_OPT

Sentry can automatically detect the same word appearing twice in a row (e.g., Minutes of the the meeting were filed yesterday.) This option controls whether doubled words are reported to your application.

REPORT_MIXED_CASE_OPT

Sentry can automatically detect words which have an unusual combination of upper- and lower-case letters (e.g., TUesday). This option controls whether mixed-case words are reported to your application.

REPORT_MIXED_DIGITS_OPT

Sentry can automatically detect words which contain embedded digits, which might result from accidentally omitting a space (e.g., Monday14 August). This option controls whether words with embedded digits are reported to your application.

SPLIT_CONTRACTED_WORDS_OPT

When this option is enabled, apostrophes will if necessary be treated as word separators, and each sub-word checked individually. This option is intended for use with Wintertree Software's French and Italian dictionaries.

SPLIT_HYPHENATED_WORDS_OPT

When this option is enabled, hyphens ("-") will if necessary be treated as word separators, and each sub-word checked individually.

SPLIT_WORDS_OPT

When this option is enabled, words will if necessary be treated as a series of concatenated sub-words, and each sub-word checked individually. This option is intended for use with Wintertree Software's German and Finnish dictionaries.

STRIP_POSSESSIVES_OPT

When this option is enabled, possessives of the form 's and s' will be removed from words before checking their spelling.

SUGGEST_PHONETIC_OPT

When this option is enabled, the Suggest function locates suggested replacements for misspelled words using phonetic (sounds like) matching. Phonetic matching is best used with words which are badly misspelled, as can happen when checking text entered by children or people learning a second language. (Currently, phonetic matching works only with English pronunciation rules.)

SUGGEST_SPLIT_WORDS_OPT

When this option is enabled, the Suggest function will attempt to split the misspelled word into two, and will offer the split words as suggestions if both are valid words. This is useful for correcting words incorrectly joined by a missing space -- e.g., if this option is enabled, the Suggest method would suggest the dog as a replacement for thedog.

SUGGEST_TYPOGRAPHICAL_OPT

When this option is enabled, the Suggest function locates suggested replacements for misspelled words using typographical (looks like) matching. Typographical matching is best used with words which contain one or two spelling errors. See the SUGGEST_PHONETIC_OPT option for more information.

About character sets

The Sentry engine can be built to support two character set types: The single-byte ISO-8859 character set collection, and the double-byte Unicode set.

NOTE: Unicode is supported only in Sentry Source SDK, and only at the Core API and Stateless API levels.

A character set is a formalized mapping of character shapes with numeric codes that represent those shapes. For example, in the familiar ASCII character set, the letter "A" is represented as numeric code 65. Computers and software that support ASCII all agree that numeric code 65 represents "A". ASCII is very widespread, and if all the text to be spell-checked contained only the 26 letters "A" through "Z" plus the digits and punctuation contained within the ASCII set, character sets would not be an issue. Such a narrow approach would work only for English, however. Words in other languages require characters outside the ASCII set, such as "Ø" and "ü".

The ISO-8859 character set family uses a single byte to represent characters. With one byte, 256 characters can be represented, with numeric codes ranging from 0 to 255. The ASCII character set also uses a single byte, and in fact uses only half (128) of the available numeric values, from 0 to 127. The first 128 characters, with numeric codes 0 to 127, of each ISO-8859 character set are identical to ASCII. The upper 128 characters, with numeric codes 128 to 255, contain (among other things) letters with diacritical marks (accents) appropriate to a specific human language or group of languages. Because the lower half of the ISO-8859 character sets are identical to ASCII, an application which needs only ASCII can use any of the ISO-8859 character sets. Ten ISO-8859 character sets exist, named ISO-8859-1 to ISO-8859-10.

The Unicode character set is an attempt to combine character shapes used by most human languages into a single set. In Unicode, each character is represented as a two-byte value, known as UCS-2. (A four-byte version of Unicode, known as UCS-4, also exists but is not supported by the Sentry engine.) A range of 65536 values can be expressed in two bytes. The lower 256 characters in Unicode are identical to the ISO-8859-1 character set, and the lower 128 characters are identical to ASCII.

The Sentry engine uses information about a character set when checking the spelling of strings and text blocks to classify characters as alphabetic, spaces, punctuation, etc., to determine where word boundaries lie. It also uses information about the character set to determine character similarity when searching for suggestions (e.g., "Á" is more similar to "A" and "á" than it is to "É").

Windows SDK: The Sentry Windows DLL supports the ISO-8859 character sets only.

ISO-8859

By default, the Sentry engine uses the single-byte ISO-8859 character set family, and by default the ISO-8859-1 member of that family, also known as Latin 1. Other ISO-8859 character sets can be selected through a run-time setting. Following is a description of each character set:

Note that the use of these character sets does not imply any particular capabilities in the Sentry engine with regard to support for languages which can be represented using the character sets. For example, although the Sentry engine contains information about ISO-8859-6, it does not contain information about Arabic languages that can be expressed using that character set.

Detailed information on the ISO-8859 character sets is available in documents published by the International Standards Organisation (ISO). A helpful though unofficial reference is available at http://czyborra.com/charsets/iso8859.html.

Unicode

The Sentry source code can be built to support Unicode through a compile-time setting. The Unicode build of the Sentry engine interprets all character data (the SSCE_CHAR data type) as 2-byte UCS-2 characters. The Unicode Sentry engine does not directly support any of the UCS Transformation Formats (UTFs).

Text lexicons written by the Unicode Sentry engine will contain UCS-2 characters in "big endian" (most significant byte first) order. If the first character of a text lexicon is a byte-order mark (BOM), the Sentry engine will use the indicated byte ordering when reading the lexicon.

Lexicons compressed using the Unicode Sentry engine will contain UCS-2 characters.

English compressed lexicons included with the Sentry Source SDK use single-byte ISO-8859-1 characters. The Unicode Sentry engine will read either single-byte compressed lexicons or 2-byte Unicode lexicons.


Home Site index Contact us Catalog Shopping Cart Products Support Search


Copyright © 2015 Wintertree Software Inc.