Wintertree Software Inc.

WGrammar Grammar Checker Engine

Home Site index Contact us Catalog Shopping Cart Products Support Search

You are here: Home > Products > Developer tools > WGrammar > Technical details


Pattern Matching in WGrammar

This document describes WGrammar's pattern matching capabilities, and includes detailed information on the syntax used to match words.

At its core, WGrammar is a search engine that locates patterns in text. WGrammar comes with a large set of patterns that match common English grammatical problems. You can supplement this set by adding your own patterns. You can even use the WGrammar engine to search text for purposes other than grammar checking, such as

The WGrammar engine has several important characteristics that define its capabilities:

How patterns are formatted

The key unit of searching in WGrammar is the target. A target contains:

  1. A collection of sequences (explained below)

  2. A description string
  3. A replacement string
  4. A target id.

A target contains one or more sequences. A sequence is a series of word patterns (patterns that match words). The word patterns in a sequence must match in order for the sequence to match. When any of a target's sequences match, the target itself matches.

A target typically exists to match text for some purpose, such as detecting a specific grammar error. The error may exist in several forms, so the target may contain sequences to match the error in its different forms.

Sequences

The sequence is the actual pattern that matches text. A word pattern is a pattern that matches one word, and a sequence is a set of one or more word patterns that must be matched in order. When all of the word patterns in the sequence are matched in the proper order, the sequence itself is matched.

For example, the sequence

hello

contains one word pattern that would match the word "hello" wherever it appeared in the text being searched. The sequence

hello world

would or would not match the example text shown in the table below:

Text

Match?

Notes

hello world

Yes

Each word in sequence appears in correct order

world hello

No

Incorrect word order

hello there world

No

Intervening word breaks the sequence

world

No

Not all words matched

hello, world?

No

Punctuation must be matched or explicitly ignored

Hello world

No

WGrammar is case sensitive

The preceding table introduces some important facts about how WGrammar matches text:

Word patterns can optionally contain wildcards, as described in the following table.

Wildcard

Description

Examples

*

Matches any zero or more characters (unless it forms the entire word pattern, in which case it matches any word of 1 or more characters).

app* matches app, apple, apply, application, apple!, apply."

*ing matches singing, dancing, wing, ing

a*t matches at, apt, abrupt

?

Matches any single character

?ing matches king, ping, ring, sing, wing

?? matches any 2-character word (e.g., at, my, so)

[...]

Matches any character in the specified set. The set may contain a range of characters separated by a dash.

[A-Z]* matches any word starting with a capital letter.

*[aeiou] matches any word ending with a vowel

*[0-9]* matches any word containing a digit

[Aa]pple matches Apple or apple.

[^...]

Matches any character not in the specified set. The set may contain a range of characters separated by a dash.

[^aeiouAEIOU]* matches any word which doesn't start with a vowel

[^0-9] Matches any single-character word which isn't a digit

{...}

Matches any of the comma-separated strings appearing within the braces, including empty strings.

Note that this wildcard is handled in a special way if it appears at the start of a word pattern, as described below.

watch{ing,ed,} matches watch, watched, watching

{dog,cat} matches dog or cat

{^...}

Matches strings not in the comma-separated strings appearing within the braces; i.e., the word will not match if it contains any of the strings within the braces.

Note that this wildcard is handled in a special way if it appears at the start of a word pattern, as described below.

watch{^ing,ed,} matches watcher and watches but not watch, watched, or watching

The backslash character ("\") can be used to remove the special meaning of the wildcard characters. For example, to match "[dog]" use the word pattern "\[dog\]".

Word patterns are separated in a sequence by one or more white-space characters. The white space serves only to separate word patterns and is otherwise ignored by WGrammar. In particular, the number or type of white-space characters separating word patterns has no bearing on how the sequence matches text.

Word patterns can match words as specific character strings (such as "hello") and they can match words according to part-of-speech (POS) categories. POS categories are matched by POS codes (shown in parentheses) as follows:

Most English verbs have the same form in a given tense regardless of the point of view (first, second, or third person) or number (singular or plural). For example, the verb call has the same form (call) in all present tense points of view and numbers except third-person singular, where its form is calls. Similarly, call has the same form (called) in all past-tense points of view and numbers. Only forms of the verb to be change with the point of view and number. To avoid redundancy, most present-tense verbs are matched by the %V1SP and %V3SP POS codes, and most past-tense verbs are matched by the %VPAT POS code. Only forms of to be are listed in more specific present- and past-tense categories, such as %V2SP and %V3PA.

In English, all future tense verbs are matched by the %V1SP POS code, but are preceded by auxiliary verbs shall (in the first person) or will (in the second and third persons). Similarly, all present-perfect, past-perfect, and future-perfect verbs are matched by the %VPAP POS code. The tense, point of view, and number are indicated by auxiliary verbs have, has, had, shall have, and will have.

WGrammar matches words in a POS category in a simple way: If the word can be a member of the POS category in any context, it matches. A particular word may match several POS categories. For example, "set" might match %NS, %V1SP, %VPAT, %VPAP, and %ADJ. Sometimes this ambiguity creates problems. For example, the word fish is both a singular and plural noun, and might cause an undesirable match with %NS when used in its plural sense.

It is often useful to match several possible word patterns at the same point in a sequence. WGrammar supports word pattern sets, which provide this function. A word pattern set is a collection of comma-separated word patterns surrounded by braces. For example, the set

{red,green,blue}

matches the words red, green, or blue. The word pattern sequence

{hello,goodbye} {sailor,there,world}

matches hello sailor, hello there, hello world, goodbye sailor, goodbye there, and goodbye world.

Word pattern sets are similar to the set wildcard, but they have somewhat different behavior and generally result in faster searches. WGrammar recognizes a word pattern set when the opening brace ("{") appears at the start of the word pattern. If the opening brace appears within the word pattern, a wildcard set is recognized. The following two word patterns match the same text, but are processed in different ways by WGrammar:

The elements of a word pattern set are word patterns. A word pattern set element cannot be a word pattern set (although it can be a word pattern that uses the set wildcard).

The default relation between elements in a word pattern set is "or," meaning the word pattern set matches if any one or more of its elements match. The "&" modifier, when applied to a word pattern set, makes matching that element mandatory for the word pattern set as a whole to match. When "&" is applied, the relation between the elements changes to "and": the set matches if and only if the first element matches and the second element matches and the third element matches, etc. Following are some examples:

&{%NS,^%NP} Matches singular nouns that are not also plural nouns; i.e., nouns that have different singular and plural forms. Car would match, because car is singular but not plural. Fish would not match, because fish is both singular and plural, and the plural form would fail the ^%NP word pattern. The word pattern set can be read as "singular noun and not plural noun."

&{%ADJ,{*est,*er}} Matches only adjectives ending in est or er, such as quicker and slowest.

&{hello,goodbye} This would never be matched, since it requires an impossible situation: a word that is both hello and goodbye.

Generally, the "&" modifier should be applied to all elements of a word pattern set if it is applied to any. Without the "&" modifier, additional elements in a word pattern set broaden the possible range of text that may be matched. With the "&" modifier, additional elements narrow the range. For example,

{[A-Z]*,*[^aeiou],*ea*}

matches Constantinople, brother, deal, Deal, Reading, Bread, etc., whereas

&{[A-Z]*,*[^aeiou],*ea*}

matches only capitalized words that don't end in a vowel and contain the letters ea. Neither Constantinople, brother, or deal meet these criteria, but Deal, Reading, and Bread do.

Description and Replacement Strings

Description and replacement strings are associated with each pattern and are available to your application for display to the user or other purposes. How and if the description and replacement strings are used is entirely up to your application.

The strings can optionally contain references to the actual words matched by the associated word pattern sequence. This is especially useful for word pattern sequences containing wildcard and POS code word patterns, since the wildcards or codes may match many different words. If the word pattern is preceded by the modifier =n, the text matched by the word pattern is saved in a "mark" numbered n. The placeholder like "%n" refers to the contents of mark n. For example, to replace could of, should of, or would of with could have, should have, or would have, you might specify the following sequence and replacement strings:

Sequence: =1{could,should,would} =2of
Replacement string: %1 have
Matched text: should of
Resulting replacement: should have

The replacement string in the preceding example will replace "%1" with the actual word matched by the first word pattern in the sequence (should).

WGrammar contains modifiers to edit mark references, as show in the following table.

Modifier

Description

Example

%d

Make the following word lower case.

%d%1
Mark 1: Hello
Result: hello

%m%n

Apply the case pattern of the word in mark n to the following word.

%m1hello
Mark 1: Now
Result: Hello

%n%n

Apply the noun form of referenced mark n to the following word. For example, if mark n contains a plural noun, make the following word plural. If either the referenced word or the following word is not a noun, the following word is left unchanged. If the referenced word is both singular and plural, the following word is made singular.

%n%1truck
Mark 1: cars
Result: trucks

%n%1%2
Mark 1: car
Mark 2: trucks
Result: truck

%n%1dogs
Mark 1: fish
Result: dog

%n%1dogs
Mark 1: going
Result: dogs

%p%n

Apply the pronoun case of mark n to the following word. For example, if mark n contains a nominative-case pronoun, make the following word nominative case. If either the referenced word or the following word is not a pronoun or is a pronoun with no specific case, the following word is left unchanged. If the case of the referenced word is ambiguous, the following word is made nominative, objective, or possessive case, whichever first matches the referenced word.

%p%1him
Mark 1: their
Result: his

%p%1they
Mark 1: her
Result: them

%p%1both
Mark 1: me
Result: both

%v%n

Apply the verb form of referenced word n to the following word. The form of the referenced word is tested in the following order: V1SP, V3SP, VPAT, VPAP, VING. The first matching form is applied to the following word. If either the referenced word or the following word are not verbs, the following word is left unchanged.

%v%1steal
Mark 1: borrowed
Result: stole

%v%1steal
Mark 1: is
Result: steals

%v%1steal
Mark 1: unusual
Result: steal

%NP
%NS

Transform the following word to a singular or plural noun. If the following word is not a noun or is already in the specified form, it is left unchanged.

%NS%1
Mark 1: bacteria
Result: bacterium

%NP%1
Mark 1: truck
Result: trucks

%PRON
%PROO
%PROP

Transform the following word to a pronoun of the specified case. If the following word is not a nominative-, objective-, or possessive-case pronoun, or is already in the specified case, it is left unchanged.

%PROOwho
Result: whom

%PROPwhoever
Result: whosever

%PROPher
Result: her

%PRONher
Result: she

%V1SP
%V3SP
%VPAT
%VPAP
%VING

Transform the following word to a verb of the indicated form (1st person singular present tense, 3rd person signular present tense, past tense, past-perfect tense, present participle).

If the following word is not a verb, it is left unchanged. Homonyms with different past- or perfect-tense forms such as hang (hanged, hung) or ring (ringed, rung) are transformed to the most common form.

%VPAPhang
Result: hung

%VPAPhanged
Result: hung

%V1SPstolen
Result: steal

%VPATfortune
Result: fortune

The preceding rules apply to both the description and replacement strings; each are processed in an identical manner. In the grammar pattern file shipped with WGrammar, alternative replacements in the replacement string are separated by the text " -OR- " (note that one space precedes and follows the string). This is a convention only; if you define your own patterns, you can use any mechanism you prefer to separate alternative replacements. A typical replacement string with alternative replacements might look like this:

had %V00E%2 -OR- have %V00E%2 -OR- has %V00E%2


Home Site index Contact us Catalog Shopping Cart Products Support Search


Copyright © 2015 Wintertree Software Inc.