Sentry Spelling Checker Engine - Support

Home Site index Contact us Catalog Shopping Cart Products Support Search

You are here: Home > Support > Sentry Spelling Checker Engine > Converting lexicon format

Converting version 4.x (and earlier) text lexicon files to version 5.x format

Product: Sentry Spelling Checker Engine Windows SDK, Sentry Spelling Checker Engine Source SDK

Problem: The format of text lexicon files changed when Sentry version 5.1 was released. This article describes how older-format .tlx files can be converted to the new format.

Solution:

Prior to version 5.1, text lexicon files had specific purposes or types: Change type, Exclude type, Ignore type, and Suggest type. The type of a text lexicon file determined how the Sentry engine would behave when a word contained by the lexicon was encountered. All of the words contained in a text lexicon would produce the behavior indicated by the text lexicon's type.

Starting with version 5.1, the text lexicon file format was changed so actions could be assigned to specific words in a text lexicon. A single text lexicon can now contain words that are to be automatically changed, excluded, ignored, or conditionally changed.

The most common use of text lexicons is to hold words that are to be ignored (skipped) when they are encountered by the engine. Prior to version 5.x, these words were stored in Ignore-type text lexicons. In version 5.1 and later, these words have the SSCE_IGNORE_ACTION action assigned to them. The "ignore" action is considered the default, so any words which do not have an action specified are treated as though they had the "ignore" action specified. Therefore, any version 4.x Ignore-type text lexicon may be safely used by a version 5.x Sentry engine. This includes the "uignore.tlx" sample text lexicon file shipped with the Sentry SDKs.

Words in Change-type, Exclude-type, and Suggest-type text lexicons require conversion. The following Perl script can be used to convert these lexicon types to 5.x format:

# Convert a pre-5.x text lexicon (.tlx) file to the version 5.x format.
# The converted lexicon is written to stdout.
# Copyright (c) 1999 Wintertree Software Inc.
# www.wintertree-software.com

# Set up a map to convert the old language ids to the 5.x values.
%langMap = (
    1033, 24941,
    2057, 25202,
    1027, 29539,
    1029, 25466,
    1030, 25697,
    1043, 25717,
    1035, 26217,
    1036, 26226,
    1031, 26469,
    1038, 26741,
    1040, 26996,
    1044, 25442,
    2068, 25444,
    1045, 28780,
    1046, 28770,
    2070, 28783,
    1049, 29301,
    1034, 29552,
    1053, 29559
);

# Read the header line by itself.
$header = <>;
($lid, $oldLangId, $lexType) = split(" ", $header);

# Convert the header line.
print "$lid $langMap{$oldLangId}\n";

# Convert the words.
while (<>) {
    chop;
    if ($lexType == 0) {
        # change-type => SSCE_AUTO_CHANGE_PRESERVE_CASE_ACTION
        ($word, $otherWord) = split(/:/);
        print "$word\tA$otherWord\n";
    }
    elsif ($lexType == 1) {
        # ignore-type => SSCE_IGNORE_ACTION
        print "$_\ti\n";
    }
    elsif ($lexType == 2) {
        # suggest-type => SSCE_CONDITIONAL_CHANGE_PRESERVE_CASE_ACTION
        ($word, $otherWord) = split(/:/);
        print "$word\tC$otherWord\n";
    }
    elsif ($lexType == 3) {
        # exclude-type => SSCE_EXCLUDE_ACTION
        print "$_\te\n";
    }
}

Home Site index Contact us Catalog Shopping Cart Products Support Search