Wintertree Software Inc.

Wintertree Thesaurus Engine Source SDK

 

Home Site index Contact us Catalog Shopping Cart Products Support Search

You are here: Home > Products > Developer tools > Wintertree Thesaurus Engine > Source SDK > Technical details


Technical information

Contents:

Wintertree Thesaurus Engine Application-Program Interface (API) Summary

The Core Wintertree Thesaurus Engine API allows access to the thesaurus database from applications.

GetFirstCategory, GetNextCategory

Obtain the set of word categories matching a given key term.

GetFirstTerm, GetNextTerm

Obtain the set of terms (synonyms) contained by a given word category.

OpenThesaurus

Open a thesaurus file, making its contents accessible for searches.

AddTerm, DelTerm

Add or remove terms to or from a user thesaurus file.

GetFirstSuggestion, GetNextSuggestion

Obtain suggested alternatives for misspelled or unknown terms.

CompileInit, CompileFile, CompileEnd

Compile a thesaurus file. Compiling reduces the time required to load the thesaurus, assembles several files into one, and offers some security.

GetCategoryAntonym

Obtain a word category with a meaning opposite to a given category.

Thesaurus Structure

Thesauri used by Wintertree Thesaurus Engine have a simple structure: A thesaurus contains one or more categories, and categories contain one or more terms. Put another way, a category is a collection of terms, and a thesaurus is a collection of categories.

A category usually (but not necessarily) contains terms with the same specific meaning. A category also has a descriptive name and optionally the name of an antonym category. For example, a category named "happy" might contain terms such as "joyful," "delighted," "elated," etc. It might have a category named "sad" as its antonym. Typically (but not necessarily) the "happy" category would contain all of the terms that mean "happy" and are therefore synonyms (or near synonyms) for "happy."

Given a key term (a term for which you want to locate synonyms), Wintertree Thesaurus Engine can determine which categories contain terms related to it. Given a category name, Wintertree Thesaurus Engine can produce the set of terms the category contains and the name of its antonym category (if one exists). You might wonder why the thesaurus engine doesn't skip the category part and simply return a set of synonyms for the key term. The reason is many words in English (and other languages) have different meanings. Each meaning is represented by its own category. For example, the word "set" has several meanings:

The following list shows some possible categories and synonyms related to the key term "set."

Synonyms for "fixed" are of little use if the user is seeking synonyms for the verb "place."

Antonym categories: Each category can optionally have an antonym category associated with it. The antonym category contains terms which have a meaning opposite to the original category. The antonym category name is simply another category name associated with each category, so it's possible to use the antonym category for other, application-specific purposes.

Category names: Each category has a unique name. The name is used to identify a specific category. Category names can be up to 511 characters long, and can include any printable character. Certain conventions are used in category names to encode specific kinds of information, but otherwise category names are free-format.

Parts of speech: One of the conventions used for category names in the general-purpose thesaurus files included with Wintertree Thesaurus Engine is a part-of-speech designation. The part-of-speech designation appears at the end of the category name and is represented by the following abbreviations:

For example, a category containing adjectives which are synonyms for "happy" might be named

happy (adj.)

Hierarchal thesaurus structure

Using certain conventions for category names and terms, it is possible to create the illusion of a hierarchal thesaurus, where categories appear to contain other categories. The hierarchal structure is an illusion because there is no hierarchy in the way the thesaurus is actually organized: a thesaurus is simply a flat collection of categories, and each category is a collection of terms. The pseudo-hierarchal structure is nevertheless flexible and useful.

The hierarchal thesaurus structure is effected through the following conventions:

Here's an example showing how these conventions are used to form a hierarchal thesaurus (note that only part of the thesaurus is shown):

/ (root category) contains:

>animals (reference to category /animals)
>plants (reference to category /plants)

/animals contains:

>amphibians
>birds
>fish
>mammals
>reptiles

/animals/mammals contains:

>cats
>dogs

/animals/mammals/dogs contains:

Airedale
boxer
golden retriever

The example above shows the contents of four categories (/, /animals, /animals/mammals, and /animals/mammals/dogs). A number of other categories, including /plants and /animals/mammals/cats are referenced but not shown. The first three categories contain only references to child categories, denoted by the ">" character. The /animals/mammals/dogs category contains three terms. The terms do not begin with ">".

Here are some other points to note about the pseudo-hierarchal thesaurus structure:

Thesaurus files included with the Software Development Kit

The Wintertree Thesaurus Engine SDKs comes with several thesauri, each in American, British (International), and Canadian spelling variants. This topic describes each included file.

Very few brand names are used. Brand names appear only where they are popularly synonymous with a particular product type (e.g., "Kleenex" and "Walkman").

The terms contained by a category usually have an "is a" or "kind of" relationship with the category. For example, a category named "cosmetics" might contain terms "makeup" and "makeup remover," even though makeup remover is not strictly speaking a cosmetic. This is done to avoid creating categories that contain just one term. Usually, where a single product has multiple names (synonyms), a separate category has been created for that product.

Category names are usually plural nouns. No parts-of-speech designators are used in category names.

Terms listed in categories are in the singular form only, unless the plural form cannot be derived by simply appending "s" to the term, in which case both singular and plural forms are included. For example, "mint" is listed in singular form only, because the plural form ("mints") can be obtained by simply appending "s". "Candy" is listed in both singular and plural form ("candies") because the plural form is irregular.

Different common spellings of the same term are usually included (e.g., "makeup," "make up," "make-up").

The thesaurus has a hierarchal structure. Categories at the top of the hierarchy are very general, such as departments in a department store. Categories at the bottom represent specific types of consumer products.

The top-level hierarchy of the product thesaurus is shown below:

/products/apparel
  >accessories
  >athletic wear
  >costumes
  >dresses
  >evening wear
  >footwear
  >hangers
  >infant
  >jackets
  >neckwear
  >outerwear
  >pants
  >shirts
  >sleepwear
  >sweaters
  >underwear
  >uniforms

/art

/automotive
  >accessories
  >audio
  >car care products
  >consumables
  >parts
  >security
  >tarps
  >tools
  >towing needs

/baby care
  >bathing
  >diapers
  >linens
  >nursery
  >feeding
  >safety
  >strollers
  >toys

/beauty
  >accessories
  >aromatherapy
  >bath products
  >fragrance
  >hair care
  >makeup
  >nail care
  >skin care

/books
  >art
  >entertainment
  >biographies
  >business
  >children
  >computers
  >cooking
  >health
  >history
  >horror
  >house and garden
  >literature
  >magazines
  >mystery
  >parenting
  >philosophy
  >reference
  >religion
  >romance
  >science and nature
  >science fiction and fantasy
  >society and culture
  >sports and recreation
  >suspense and thrillers
  >transportation
  >travel
  >westerns
  >young adults

/collectibles
  >coins and currency
  >memorabilia
  >miniatures
  >posters and prints
  >stamps

/computers
  >accessories
  >components
  >desktops
  >digital cameras
  >input devices
  >Internet appliances
  >modems
  >monitors
  >networking
  >notebooks
  >PDAs and handhelds
  >printers
  >scanners
  >software
  >sound cards
  >storage
  >video cards

/electronics
  >accessories
  >communications
  >home stereo
  >portable audio
  >television and video

/food and beverages
  >beverages
  >food

/garden and patio
  >accents
  >patio furniture
  >seeds and bulbs
  >tools and equipment
  >pools and spa supplies
  >gardening apparel
  >growing supplies
  >indoor gardening
  >pest control
  >pottery and planters
  >soil and soil amendments
  >barbecue and grill
  >patio fireplaces and heaters

/gifts
  >flowers
  >executive gifts
  >gift baskets
  >gift wrap
  >novelties
  >personalized gifts
  >thematic gifts
  >greeting cards
  >occasions

/hardware
  >adhesives
  >batteries
  >door hardware
  >electrical
  >general hardware
  >plumbing supplies
  >tools

/health
  >conditions
  >eye care
  >family planning
  >first aid
  >hearing
  >home test kits
  >infant
  >massage products
  >men's health
  >oral hygiene
  >personal hygiene
  >vitamins and supplements
  >weight loss
  >women's health

/home
  >appliances
  >bathroom
  >bedroom
  >flooring
  >furnishings
  >housewares
  >kitchen
  >outside
  >pet and animal supplies
  >security

/jewelry (spelled "jewellery" in the British and Canadian versions)
  >accessories
  >bracelets
  >earrings
  >necklaces
  >precious gems
  >rings
  >semi-precious gems
  >watches

/movies and video
  >movies
  >television
  >how-to
  >documentary

/music
  >disk jockey equipment
  >instruments
  >karaoke
  >memorabilia
  >recordings
  >sheet music

/office supplies
  >adhesives
  >banking and cash handling
  >binders and lamination
  >boards
  >boxes
  >business forms
  >business machines
  >calendars and organizers
  >fasteners
  >filing
  >furniture
  >letters
  >luggage
  >paper
  >shipping
  >stamps
  >staplers
  >stationery
  >writing instruments

/photography
  >accessories
  >camcorders
  >cameras
  >optics

/sports and outdoors
  >athletic shoes
  >aviation
  >boating
  >camping and hiking
  >climbing
  >fishing
  >fitness
  >individual sports
  >navigation
  >paddling
  >sports supplements
  >team sports
  >scuba diving and snorkeling

/toys and diversions
  >baby toys
  >construction
  >dolls
  >games
  >hobbies and crafts
  >outdoor toys
  >stuffed toys
  >vehicles

/travel
  >accessories
  >appliances and electronics
  >luggage
  >maps
  >personal care
  >security

How Wintertree Thesaurus Engine searches open thesaurus files

Wintertree Thesaurus Engine searches for key terms, categories, and antonyms in open thesaurus files. All of the open thesaurus files are treated like one large thesaurus (except when searching for antonyms, which is described below). Categories from each thesaurus file are combined to form a large pool of categories which are searched.

Duplicate category names: If a category with the same name exists in two or more thesaurus files, the terms from each category are effectively merged into a single category of that name. For example, if thesaurus file "A" contains category "happy (adj.)" with terms "happy" and "bubbly," and thesaurus file "B" contains category "happy (adj.)" with terms "joyful" and "delighted", then the two instances of "happy (adj.)" are effectively combined into a single category containing terms "bubbly," "delighted," "happy," and "joyful."

Antonym categories: A category may optionally contain the name of an antonym category. Your application can request Wintertree Thesaurus Engine to locate the name of an antonym category associated with another category. Wintertree Thesaurus Engine searches all open thesauri for a category with the specified name. When it finds one, it checks whether the category has an antonym name defined. If so, Wintertree Thesaurus Engine stops searching and returns that antonym name. If not, the search continues through other thesaurus files. The thesaurus files are searched in the order in which they were opened. Note that if a category with the same name exists in two different thesaurus files, and each category has a different antonym name, then the antonym name of the first category will be returned.

How Wintertree Thesaurus Engine formats text thesaurus files

Text and compiled: Wintertree Thesaurus Engine thesaurus files come in two formats: text and compiled. Text thesaurus files are ASCII files (more accurately, Latin1 or ISO-8859-1) that can be modified at run time. Compiled thesaurus files contain binary data and are read-only at run time. We use the extensions "tth" for text thesaurus files and "cth" for compiled thesaurus files. These file-name extensions are conventions only.

Text-format thesaurus files are stored in a special layout defined by Wintertree Thesaurus Engine. Text files submitted to the CompileFile function must also be stored in this layout. This section defines the layout of text-format thesaurus files.

A thesaurus file contains zero or more categories. Each category contains a category name, an optional antonym category name, and a set of zero or more terms.

Category definitions

A category is represented in the file by a category definition line followed by zero or more lines containing terms. A category definition line starts with a colon (":") in column one. The category name starts in column two.

The antonym category name, if defined, follows the category name and is separated from the category name by one or more space characters. If the category name contains spaces, it must either be enclosed in double-quotation marks (") or the spaces must be preceded by a backslash (\).

Following are some example valid category definition lines:

:"happy (adj.)"
:"happy (adj.)" "sad (adj.)"
:happy\ (adj.) sad\ (adj.)

Note that if double-quotation marks are used to surround category names that contain spaces, the first mark appears after the colon (:) in the category definition line. The colon identifies the category definition line, but is not part of the category name.

The category name and antonym name can each contain up to WTHES_MAX_CAT_NAME_LEN (511) characters. The names can contain any printable Latin 1 (ISO-8859-1) character (including spaces). Case is significant.

Hierarchal thesaurus shortcuts

In a hierarchal thesaurus, certain shortcuts can be used to refer to previously defined categories to avoid having to fully specify a long series of categories and child categories. The string "./" (period followed by a forward slash) appearing at the beginning of a category name refers to the name of the last fully specified category appearing in the same thesaurus file. A fully specified category name is one that begins with a forward slash (/). For example, suppose thesaurus file mammals.tth contains the following categories:

:/animals/mammals
  >canines
  >felines

:./canines
  >dogs
  ...

:./felines
  >cats
  ...

:./canines/dogs
  golden retriever
  greyhound
  ...

:./felines/cats
  Manx
  Persian
  Siamese
  ...

Category ./canines is a shortcut for /animals/mammals/canines, and category ./felines/cats is a shortcut for /animals/mammals/felines/cats. The "./" refers to the last fully specified category name in the same file, and the last (and only) fully specified category name in mammals.tth is /animals/mammals. Category names such as "./felines" and "./canines/dogs" are not fully specified, because the "./" part of the category name is relative (fully specified category names begin with "/", not "./"). All of the "./" references in this file refer to /animals/mammals. If another fully specified category name was defined in the middle of mammals.tth, then all relative category names appearing after it would refer to that name.

To take advantage of this shortcut, it's useful to create one thesaurus file for each branch of the hierarchy. For example, we organized the product name thesaurus as one thesaurus file for every top-level category: apparel, art, automotive, baby care, and so on. Each of these top-level categories is a major branch of the product-name hierarchy, and each branch exists in its own thesaurus file. The first category in each file defines the top-level category in that branch, and the category name is fully specified. For example, the beginning of the file containing the "apparel" branch of the hierarchy looks something like this:

:/products/apparel
  >accessories
  >athletic wear
  >costumes
  >dresses
  >evening wear
  >footwear
  >hangers
  >infant
  >jackets
  >neckwear
  >outerwear
  >pants
  >shirts
  >sleepwear
  >sweaters
  >underwear
  >uniforms

:./accessories
  >bags
  >hosiery
  >wraps

:./accessories/bags
  ...

Another shortcut that can be used to form relative category names is "../" (two periods followed by a forward slash). This shortcut is similar to "./", except that it refers to the last fully specified category name minus the last component in the name. For example, if "/a/b/c/d" is the last fully specified category name, then "../e" is a shortcut for "/a/b/c/e". The last component ("d") is removed. A number of "../" can be strung together to eliminate successive components from the end of the category name: "../../f" is a shortcut for "/a/b/f".

Terms

The set of terms contained by the category immediately follows the category definition line. Terms are delimited by the end of the line or by commas (,) if multiple terms are defined on the same line. All words in a multi-word term must appear on the same line. No fixed limit exists on the number of terms per category.

Each term can contain up to WTHES_MAX_TERM_LEN (127) characters. The term can contain any printable Latin 1 (ISO-8859-1) character, including spaces. If the term contains commas, the term must either be surrounded by double-quotation marks (") or the commas must be preceded by a backslash (\). The terms can appear in any order. Case is significant. Following are some valid example terms:

dog
cat, monkey,
zebra, "lions, tigers, and bears"
cows\, pigs\, and chickens

Child category references

In a hierarchal thesaurus, a parent category can contain terms which refer to child categories. This is done through the convention of prefixing ">" onto the child category's name. The ">" is completely ignored by Wintertree Thesaurus Engine. When the calling application requests the set of terms contained by a category, the presence of ">" at the beginning of a term indicates that the term is actually a child category reference. In response, the calling application can derive the full name of the child category by replacing the ">" with "/" and concatenating the result onto the parent category's name.


Home Site index Contact us Catalog Shopping Cart Products Support Search


Copyright © 2015 Wintertree Software Inc.