Sentry Spelling Checker Engine C Source Code |
Home Site index Contact us Catalog Shopping Cart Products Support Search |
You are here: Home > Products > Developer tools > Sentry Spelling Checker Engine > C Source Code > Technical
Contents:
Packaging: ANSI C source code
Click here for more information about Sentry Spell Checker Engine C Source Code
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The Sentry engine supports the following options for unprecedented flexibility. All of Sentry's options have reasonable defaults so you need to change or set only a small number of properties to suit your application's requirements.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The Sentry engine can be built to support two character set types: The single-byte ISO-8859 character set collection, and the double-byte Unicode set.
NOTE: Unicode is supported only in Sentry Source SDK, and only at the Core API and Stateless API levels.
A character set is a formalized mapping of character shapes with numeric codes that represent those shapes. For example, in the familiar ASCII character set, the letter "A" is represented as numeric code 65. Computers and software that support ASCII all agree that numeric code 65 represents "A". ASCII is very widespread, and if all the text to be spell-checked contained only the 26 letters "A" through "Z" plus the digits and punctuation contained within the ASCII set, character sets would not be an issue. Such a narrow approach would work only for English, however. Words in other languages require characters outside the ASCII set, such as "Ø" and "ü".
The ISO-8859 character set family uses a single byte to represent characters. With one byte, 256 characters can be represented, with numeric codes ranging from 0 to 255. The ASCII character set also uses a single byte, and in fact uses only half (128) of the available numeric values, from 0 to 127. The first 128 characters, with numeric codes 0 to 127, of each ISO-8859 character set are identical to ASCII. The upper 128 characters, with numeric codes 128 to 255, contain (among other things) letters with diacritical marks (accents) appropriate to a specific human language or group of languages. Because the lower half of the ISO-8859 character sets are identical to ASCII, an application which needs only ASCII can use any of the ISO-8859 character sets. Ten ISO-8859 character sets exist, named ISO-8859-1 to ISO-8859-10.
The Unicode character set is an attempt to combine character shapes used by most human languages into a single set. In Unicode, each character is represented as a two-byte value, known as UCS-2. (A four-byte version of Unicode, known as UCS-4, also exists but is not supported by the Sentry engine.) A range of 65536 values can be expressed in two bytes. The lower 256 characters in Unicode are identical to the ISO-8859-1 character set, and the lower 128 characters are identical to ASCII.
The Sentry engine uses information about a character set when checking the spelling of strings and text blocks to classify characters as alphabetic, spaces, punctuation, etc., to determine where word boundaries lie. It also uses information about the character set to determine character similarity when searching for suggestions (e.g., "Á" is more similar to "A" and "á" than it is to "É").
Windows SDK: The Sentry Windows DLL supports the ISO-8859 character sets only.
By default, the Sentry engine uses the single-byte ISO-8859 character set family, and by default the ISO-8859-1 member of that family, also known as Latin 1. Other ISO-8859 character sets can be selected through a run-time setting. Following is a description of each character set:
ISO-8859-1 (Latin 1): Western European languages, including English, French, Italian, German, Spanish, Dutch, Swedish, Finnish, Norwegian, Portuguese (Brazilian and Iberian), and Danish.
ISO-8859-2 (Latin 2): Central and Eastern European languages, including Czech, Hungarian, Polish, Romanian, Croatian, Slovak, and Slovenian.
ISO-8859-3 (Latin 3): Maltese.
ISO-8859-4 (Latin 4): North European languages, including Estonian, Latvian, Lithuanian, Greenlandic, and Lappish.
ISO-8859-5: Cyrillic: Bulgarian, Byelorussian, Macedonian, Russian, and Serbian.
ISO-8859-6: Arabic.
ISO-8859-7: Greek.
ISO-8859-8: Hebrew: Hebrew and Yiddish.
ISO-8859-9 (Latin 5): Turkish.
ISO-8859-10 (Latin 6): Nordic: Greenlandic Inuit, Lappish, Icelandic.
Note that the use of these character sets does not imply any particular capabilities in the Sentry engine with regard to support for languages which can be represented using the character sets. For example, although the Sentry engine contains information about ISO-8859-6, it does not contain information about Arabic languages that can be expressed using that character set.
Detailed information on the ISO-8859 character sets is available in documents published by the International Standards Organisation (ISO). A helpful though unofficial reference is available at http://czyborra.com/charsets/iso8859.html.
The Sentry source code can be built to support Unicode through a compile-time setting. The Unicode build of the Sentry engine interprets all character data (the SSCE_CHAR data type) as 2-byte UCS-2 characters. The Unicode Sentry engine does not directly support any of the UCS Transformation Formats (UTFs).
Text lexicons written by the Unicode Sentry engine will contain UCS-2 characters in "big endian" (most significant byte first) order. If the first character of a text lexicon is a byte-order mark (BOM), the Sentry engine will use the indicated byte ordering when reading the lexicon.
Lexicons compressed using the Unicode Sentry engine will contain UCS-2 characters.
English compressed lexicons included with the Sentry Source SDK use single-byte ISO-8859-1 characters. The Unicode Sentry engine will read either single-byte compressed lexicons or 2-byte Unicode lexicons.
Home Site index Contact us Catalog Shopping Cart Products Support Search |
Copyright © 2015 Wintertree Software Inc.