Wintertree Thesaurus Engine Java SDK |
|
|
|
Home Site index Contact us Catalog Shopping Cart Products Support Search |
|
You are here: Home > Products > Developer tools > Wintertree Thesaurus Engine > Java SDK > Technical details |
|
|
Contents:
The Wintertree Thesaurus Engine API allows access to the thesaurus database from applications. The API is implemented as a collection of Java classes in a class library. The major classes are listed below.
ThesaurusSession
This is the main class through which applications access Wintertree Thesaurus Engine. It performs mapping of key words to word categories and word categories to synonyms. Major methods:
categoryNames: Enumerate the set of category names associated with a given key word.
getCategoryAntonym: Obtain the name of the category which is the antonym (opposite meaning) of a given category.
getSuggestions: Obtain a set of suggested replacements for a misspelled (or unknown) key word.
setThesauri: Define the set of thesauri to be used when looking up key words and categories in this session.
synonyms: Enumerate the set of synonyms associated with a given category name.
ThesaurusDialog
AWT dialog used to interact with the user to look up synonyms and antonyms.JThesaurusDialog
Swing dialog used to interact with the user to look up synonyms and antonyms.TextThesaurus
A persistent, editable thesaurus. TextThesaurus can be used to implement user thesauri, where users can add their own synonyms. Major methods:
addSynonym: Add a synonym to a given word category.
deleteSynonym: Remove a synonym from a given word category.
categoryNames: Enumerate the names of categories contained by the text thesaurus.
load: Load the contents of a text thesaurus from a stream.
open: Load the contents of a text thesaurus from a disk file.
setCategoryAntonym: Associate an antonym (opposite) category with another category.
Thesauri used by Wintertree Thesaurus Engine have a simple structure: A thesaurus contains one or more categories, and categories contain one or more terms. Put another way, a category is a collection of terms, and a thesaurus is a collection of categories.
A category usually (but not necessarily) contains terms with the same specific meaning. A category also has a descriptive name and optionally the name of an antonym category. For example, a category named "happy" might contain terms such as "joyful," "delighted," "elated," etc. It might have a category named "sad" as its antonym. Typically (but not necessarily) the "happy" category would contain all of the terms that mean "happy" and are therefore synonyms (or near synonyms) for "happy."
Given a key term (a term for which you want to locate synonyms), Wintertree Thesaurus Engine can determine which categories contain terms related to it. Given a category name, Wintertree Thesaurus Engine can produce the set of terms the category contains and the name of its antonym category (if one exists). You might wonder why the thesaurus engine doesn't skip the category part and simply return a set of synonyms for the key term. The reason is many words in English (and other languages) have multiple, different meanings. Each meaning is represented by its own category. For example, the word "set" has several meanings:
Set the lamp on the table. (A verb meaning "to place")
I just bought a new set of dishes. (A noun meaning "collection")
We have a set way of processing requests. (An adjective meaning "fixed")
The following list shows some possible categories and synonyms related to the key term "set."
place (verb): locate, place, post, situate, stand
collection (noun): assemblage, assembly, assortment, band, bloc, body, bunch, collection, collage, corps...
fixed (adjective): established, fixed, unvarying...
Synonyms for "fixed" are of little use if the user is seeking synonyms for the verb "place."
Antonym categories: Each category can optionally have an antonym category associated with it. The antonym category contains terms which have a meaning opposite to the original category. The antonym category name is simply another category name associated with each category, so it's possible to use the antonym category for other, application-specific purposes.
Category names: Each category has a unique name. The name is used to identify a specific category. Category names can be up to 511 characters long, and can include any printable character. Certain conventions are used in category names to encode specific kinds of information, but otherwise category names are free-format.
Parts of speech: One of the conventions used for category names in the general-purpose thesaurus files included with Wintertree Thesaurus Engine is a part-of-speech designation. The part-of-speech designation appears at the end of the category name and is represented by the following abbreviations:
Noun: (noun)
For example, a category containing adjectives which are synonyms for "happy" might be named
happy (adj.)
Using certain conventions for category names and terms, it is possible to create the illusion of a hierarchal thesaurus, where categories appear to contain other categories. The hierarchal structure is an illusion because there is no hierarchy in the way the thesaurus is actually organized: a thesaurus is simply a flat collection of categories, and each category is a collection of terms. The pseudo-hierarchal structure is nevertheless flexible and useful.
The hierarchal thesaurus structure is effected through the following conventions:
Root: The top-most category (also known as the "root" category) is named "/".
Category names: The hierarchy is created using category names. Child categories are separated from parent categories by a forward slash (/). (Note that a forward, and not a backward slash (\) is used.) For example, in the following category name:
/animals/mammals/dogs
category "dogs" is a child of category "mammals," which is a child of category "animals." The "animals" category is a child of the root category ("/").
Child category references: A parent category's terms may contain references to child categories by prefixing the child categories' names with ">". The ">" character appearing at the beginning of a term indicates that the term is actually the name of a category which is a child of the current category.
Here's an example showing how these conventions are used to form a hierarchal thesaurus (note that only part of the thesaurus is shown):
/ (root category) contains:
>animals (reference to category /animals)
>plants (reference to category /plants)/animals contains:
>amphibians
>birds
>fish
>mammals
>reptiles/animals/mammals contains:
>cats
>dogs/animals/mammals/dogs contains:
Airedale
boxer
golden retriever
The example above shows the contents of four categories (/, /animals, /animals/mammals, and /animals/mammals/dogs). A number of other categories, including /plants and /animals/mammals/cats are referenced but not shown. The first three categories contain only references to child categories, denoted by the ">" character. The /animals/mammals/dogs category contains three terms. The terms do not begin with ">".
Here are some other points to note about the pseudo-hierarchal thesaurus structure:
It is possible to refer to child categories that do not exist. When a thesaurus is compiled, unknown child category references are logged as warnings but do not result in errors.
When a term beginning with ">" is encountered, it is the calling application's responsibility (i.e., your application's responsibility) to translate that reference into a non-relative category name by replacing the ">" with "/" and concatenating the result onto the containing category's name. For example, category /animals/mammals contains term ">dogs", which is actually a reference to a child category. The full name of the child category, derived using the translation rule given previously, is /animals/mammals/dogs.
The maximum length of a category name is 511 characters. This limit includes all child categories in a particular branch of the hierarchy. The longest non-relative category name, from the root category to the deepest categories in the hierarchy, must not exceed the maximum.
Typically, the hierarchy in a thesaurus moves from general to specific: categories at the top of the hierarchy are general and broad, while categories at the bottom are specific. Child categories are specific "kinds of" their parents. For example, a dog is a kind of mammal, and a mammal is a kind of animal.
Given a complete, non-relative child category name (e.g., /animals/mammals/dog), the name of the parent category can be derived by removing the last component of the category name, the last "/" to the end of the name.
The Wintertree Thesaurus Engine SDKs comes with several thesauri, each in American, British (International), and Canadian spelling variants. This topic describes each included file.
Large general purpose: This is a large, comprehensive, non-hierarchal thesaurus useful for locating synonyms and related words. This thesaurus is intended to be presented to a human user who will interactively explore categories to find the right term. Categories contain a broad selection of terms. Category names end in one of the following parts-of-speech designators:
(adj.): Adjective
(adv.): Adverb
(noun): Noun. If the category name is a singular noun, then terms contained by the category will be singular. If the category name is a plural noun, the terms will be plural. In cases where the name is ambiguous (e.g., "fish"), the name will contain "(singular)" or "(plural)," as in "fish (singular) (noun)."
(verb): Verb. The category name indicates the tense of the terms contained by the category: Present tense (e.g., "walk"), past tense (e.g., "walked") or present participle (e.g., "walking").
Contains about 5,900 categories and 36,000 unique key terms.
Small general purpose: This thesaurus is similar in structure to the large general-purpose thesaurus, but the terms contained by each category are more closely related. Consequently, categories typically contain fewer terms. This thesaurus is better suited for use in automatic situations such as expansion of terms in a search engine than the large general-purpose thesaurus. The conventions used in the large thesaurus for parts-of-speech designators in category names are used in this thesaurus as well. Contains about 5,900 categories and 12,400 unique key terms.
Product name: Hierarchal thesaurus containing names of consumer products such as one might find in a department store or shopping mall.
Very few brand names are used. Brand names appear only where they are popularly synonymous with a particular product type (e.g., "Kleenex" and "Walkman").
The terms contained by a category usually have an "is a" or "kind of" relationship with the category. For example, a category named "cosmetics" might contain terms "makeup" and "makeup remover," even though makeup remover is not strictly speaking a cosmetic. This is done to avoid creating categories that contain just one term. Usually, where a single product has multiple names (synonyms), a separate category has been created for that product.
Category names are usually plural nouns. No parts-of-speech designators are used in category names.
Terms listed in categories are in the singular form only, unless the plural form cannot be derived by simply appending "s" to the term, in which case both singular and plural forms are included. For example, "mint" is listed in singular form only, because the plural form ("mints") can be obtained by simply appending "s". "Candy" is listed in both singular and plural form ("candies") because the plural form is irregular.
Different common spellings of the same term are usually included (e.g., "makeup," "make up," "make-up").
The thesaurus has a hierarchal structure. Categories at the top of the hierarchy are very general, such as departments in a department store. Categories at the bottom represent specific types of consumer products.
The top-level hierarchy of the product thesaurus is shown below:
/products/apparel /art
/automotive
/baby care
/beauty
/books
/collectibles
/computers
/electronics
/food and beverages
/garden and patio |
/gifts
/hardware
/health
/home
/jewelry (spelled
"jewellery" in the British and Canadian versions)
/movies and video
/music
/office supplies
/photography
/sports and outdoors
/toys and diversions
/travel |
Wintertree Thesaurus Engine searches for key terms, categories, and antonyms in open thesaurus files. All of the open thesaurus files are treated like one large thesaurus (except when searching for antonyms, which is described below). Categories from each thesaurus file are combined to form a large pool of categories which are searched.
Duplicate category names: If a category with the same name exists in two or more thesaurus files, the terms from each category are effectively merged into a single category of that name. For example, if thesaurus file "A" contains category "happy (adj.)" with terms "happy" and "bubbly," and thesaurus file "B" contains category "happy (adj.)" with terms "joyful" and "delighted", then the two instances of "happy (adj.)" are effectively combined into a single category containing terms "bubbly," "delighted," "happy," and "joyful."
Antonym categories: A category may optionally contain the name of an antonym category. Your application can request Wintertree Thesaurus Engine to locate the name of an antonym category associated with another category. Wintertree Thesaurus Engine searches all open thesauri for a category with the specified name. When it finds one, it checks whether the category has an antonym name defined. If so, Wintertree Thesaurus Engine stops searching and returns that antonym name. If not, the search continues through other thesaurus files. The thesaurus files are searched in the order in which they were opened. Note that if a category with the same name exists in two different thesaurus files, and each category has a different antonym name, then the antonym name of the first category will be returned.
Text and compiled: Wintertree Thesaurus Engine thesaurus files come in two formats: text and compiled. Text thesaurus files can be modified at run time. Compiled thesaurus files contain binary data and are read-only at run time. We use the extensions "tth" for text thesaurus files and "cth" for compiled thesaurus files. These file-name extensions are conventions only.
Text-format thesaurus files are stored in a special layout defined by Wintertree Thesaurus Engine. Text files submitted to the CompileFile function must also be stored in this layout. This section defines the layout of text-format thesaurus files.
A thesaurus file contains zero or more categories. Each category contains a category name, an optional antonym category name, and a set of zero or more terms.
A category is represented in the file by a category definition line followed by zero or more lines containing terms. A category definition line starts with a colon (":") in column one. The category name starts in column two.
The antonym category name, if defined, follows the category name and is separated from the category name by one or more space characters. If the category name contains spaces, it must either be enclosed in double-quotation marks (") or the spaces must be preceded by a backslash (\).
Following are some example valid category definition lines:
:"happy (adj.)"
:"happy (adj.)" "sad (adj.)"
:happy\ (adj.) sad\ (adj.)
Note that if double-quotation marks are used to surround category names that contain spaces, the first mark appears after the colon (:) in the category definition line. The colon identifies the category definition line, but is not part of the category name.
The category name and antonym name can contain any printable character (including spaces). Case is significant.
In a hierarchal thesaurus, certain shortcuts can be used to refer to previously defined categories to avoid having to fully specify a long series of categories and child categories. The string "./" (period followed by a forward slash) appearing at the beginning of a category name refers to the name of the last fully specified category appearing in the same thesaurus file. A fully specified category name is one that begins with a forward slash (/). For example, suppose thesaurus file mammals.tth contains the following categories:
:/animals/mammals
>canines
>felines:./canines
>dogs
...:./felines
>cats
...:./canines/dogs
golden retriever
greyhound
...:./felines/cats
Manx
Persian
Siamese
...
Category ./canines is a shortcut for /animals/mammals/canines, and category ./felines/cats is a shortcut for /animals/mammals/felines/cats. The "./" refers to the last fully specified category name in the same file, and the last (and only) fully specified category name in mammals.tth is /animals/mammals. Category names such as "./felines" and "./canines/dogs" are not fully specified, because the "./" part of the category name is relative (fully specified category names begin with "/", not "./"). All of the "./" references in this file refer to /animals/mammals. If another fully specified category name was defined in the middle of mammals.tth, then all relative category names appearing after it would refer to that name.
To take advantage of this shortcut, it's useful to create one thesaurus file for each branch of the hierarchy. For example, we organized the product name thesaurus as one thesaurus file for every top-level category: apparel, art, automotive, baby care, and so on. Each of these top-level categories is a major branch of the product-name hierarchy, and each branch exists in its own thesaurus file. The first category in each file defines the top-level category in that branch, and the category name is fully specified. For example, the beginning of the file containing the "apparel" branch of the hierarchy looks something like this:
:/products/apparel
>accessories
>athletic wear
>costumes
>dresses
>evening wear
>footwear
>hangers
>infant
>jackets
>neckwear
>outerwear
>pants
>shirts
>sleepwear
>sweaters
>underwear
>uniforms:./accessories
>bags
>hosiery
>wraps:./accessories/bags
...
Another shortcut that can be used to form relative category names is "../" (two periods followed by a forward slash). This shortcut is similar to "./", except that it refers to the last fully specified category name minus the last component in the name. For example, if "/a/b/c/d" is the last fully specified category name, then "../e" is a shortcut for "/a/b/c/e". The last component ("d") is removed. A number of "../" can be strung together to eliminate successive components from the end of the category name: "../../f" is a shortcut for "/a/b/f".
The set of terms contained by the category immediately follows the category definition line. Terms are delimited by the end of the line or by commas (,) if multiple terms are defined on the same line. All words in a multi-word term must appear on the same line. No fixed limit exists on the number of terms per category.
Each term can contain any printable character, including spaces. If the term contains commas, the term must either be surrounded by double-quotation marks (") or the commas must be preceded by a backslash (\). The terms can appear in any order. Case is significant. Following are some valid example terms:
dog
cat, monkey,
zebra, "lions, tigers, and bears"
cows\, pigs\, and chickens
In a hierarchal thesaurus, a parent category can contain terms which refer to child categories. This is done through the convention of prefixing ">" onto the child category's name. The ">" is completely ignored by Wintertree Thesaurus Engine. When the calling application requests the set of terms contained by a category, the presence of ">" at the beginning of a term indicates that the term is actually a child category reference. In response, the calling application can derive the full name of the child category by replacing the ">" with "/" and concatenating the result onto the parent category's name.
|
Home Site index Contact us Catalog Shopping Cart Products Support Search |
|
Copyright © 2015 Wintertree Software Inc. |