Wiki‎ > ‎OpenCCG‎ > ‎

VisCCG and OpenCCG

Implementing grammars using VisCCG and OpenCCG

Main UT OpenCCG page

At present this page is simply a collection of things which may be useful when writing grammars in OpenCCG. Soon these scattered bits of information will be organized into something more coherent.

Predicates

The logical forms produced by the OpenCCG parser (recall that the semantic derivation occurs in parallel with the syntactic derivation) use predicates supplied by word declarations. The predicate is specified as an attribute of the word. If you're interested in the details, see the section below on word declarations.

Here are some examples from an itty-bitty Spanish grammar:

word tortuga:N (animal): sg fem;

word tortuga:N (pred=turtle class=animal): sg fem;
word pajaro:N (pred=bird class=animal): sg masc;

Note the two different declarations for tortuga — in the first, no predicate is supplied, so this word will be represented in the logical form with the predicate ''tortuga''. If semantic class is the only attribute specified, there's no need to include the name of the attribute. (This is the only attribute for which this is true.) To use only the predicate attribute and not the semantic class attribute, you would use this sort of declaration:
 
word tortuga:N (pred=turtle): sg fem;

Top of page

Inheriting a subset of features rather than all

Recall from the determiner section of the basic tutorial that feature structures can be inherited in the process of the derivation. This inheritance is indicated in the category definition by giving the two elements the same feature structure number:

family Det(indexRel-det) {
   entry:  np<2> /^ n<2>[X]:
         X:sem-obj(<det>*);
}

Sometimes you only want some of the features to be shared rather than inheriting the whole feature structure. In that case, use the inheritsFrom construct, indicated with ''~''. The determiner category shown below inherits the ''PERS'' feature from the ''n'' it combines with, but any ''CASE'' feature associated with the ''n'' is overridden, and the result ''np'' will have accusative case.

family Det(indexRel=det) {
 np<~2>[CASE=acc] /^ n<2>[X PERS]:
         X:sem-obj(<det>*);
}

Top of page

Rules

Ben Wing's notes from the tiny.ccg grammar

Each statement specifies a single rule; it is also possible for statements to cancel some or all rules.

Note that some rules are enabled by default; this includes application, composition and crossed composition (forward and backward in each case), as well as forward type-raising from ''np'' to ''s/(s\np)'' and backward type-raising from ''np'' to ''s$1\(s$1/np)''.

rule {
  # turn off forward cross-composition
  no xcomp +;

  # this is how we could turn off all type-raising rules.
  # no typeraise;

  # Declare a backward type-raising rule from pp to s$1\(s$1/pp).
  # The $ causes a dollar-sign raise category to be created, as shown;
  # without it, we'd just get s\(s/pp).
  typeraise - $: pp => s;

  # Declare a type-changing rule to enable pro-drop (not useful in English!)
  # typechange: s[finite]\np[nom]$1 => s[finite]$1 ;
}

This shows how you can turn off all defaults and specify your own properties from scratch, if you want.

 rule {
   no; # remove all defaults
   app +-;
   comp +-; # +- means both forward and backward
   xcomp -;
   sub +-;
   xsub +-;
   # Defaults for typeraising are np => s, if omitted.
   typeraise +;
   typeraise - $;
 }

Top of page

More on expansions — basic English nominal morphology

Here's a more complex version of expansions for English nouns. Note that expansions are processed recursively: if the text of an expansion contains calls to other expansions, they will also be processed. This makes 'inheritance' very easy to implement.

Inside of an expansion, the operator ''.'' can be used to concatenate two words together into a single word. For example, look at this expansion called ''normal-noun'':

Remember that arguments functioning as variables within the expansion must be upper-case.

def normal-noun(Stem, Class) {
  word Stem:N(Class) {
    *: sg sg-X;
    Stem . s: pl pl-X;
  }
}

We can declare regular nouns in English as simply as this:

normal-noun(book, thing)
normal-noun(car, thing)
normal-noun(bee, animal)

Or we could do this with two nested expansions, ''basic-noun'' and ''normal-noun'':

def basic-noun(Sing, Plur, Class) {
  word Sing:N(Class) {
    *: sg sg-X;
    Plur: pl pl-X;
  }
}

def normal-noun(Stem, Class) {
      basic-noun(Stem, Stem . s, Class)
}

And again, the same ''normal-noun'' declarations work:

normal-noun(book, thing)
normal-noun(car, thing)
normal-noun(bee, animal)

Top of page
Built-in expansion functions
We can do something even more clever to handle pluralization. This section discusses three built-in expansion functions. All three do some sort of text replacement. All three follow normal Python conventions for regular expressions

----

''regsub''

This is a conditional replacement function. It takes three arguments: a regular expression (''PATTERN''), a text (''TEXT'') to be compared to the regular expression, and a replacement text (''REPLACEMENT''). Any instances of ''PATTERN'' found in ''TEXT'' are replaced with ''REPLACEMENT''.

This is the syntax of the function:

regsub(PATTERN, REPLACEMENT, TEXT)

A simple example — the ''regsub'' function shown below will replace any occurrence of ''a'', ''b'', or ''c'' with ''d''.

regsub('[abc]','d',TEXT)

If we apply this to the text ''bad'', we get the result ''ddd''. (why one would ever want such a function is a different question...) A more realistic use of ''regsub'' is illustrated below in the expansion ''pluralize''.

----

''ifmatch''

This function differs from ''regsub'' in two important ways. First, ''regsub'' does a localized replacement on any occurrence of ''PATTERN'' that it finds within ''TEXT''. If ''PATTERN'' does not occur in ''TEXT'', no replacements are made. ''ifmatch'' instead does a global replacement of ''TEXT'' which is triggered only when the regular expression ''PATTERN'' is found at the beginning of ''TEXT''.

The second major difference is that ''ifmatch'' works like an ''if-else'' statement. It requires specification of one replacement text (''IF-TEXT'') to be used when there is a match between ''PATTERN'' and the beginning of ''TEXT'' and a second replacement text (''ELSE-TEXT'') to be used when there is not such a match.

This is the syntax of the function:

ifmatch(PATTERN, TEXT, IF-TEXT, ELSE-TEXT)


If the regular expression ''PATTERN'' is found at the beginning of ''TEXT'', the function will replace ''TEXT'' with ''IF-TEXT''. If ''PATTERN'' is not found at the beginning of ''TEXT'', the function will replace ''TEXT'' with ''ELSE-TEXT''.

And here's another weird example -- imagine a group of publishers has decided that the world of linguistics is suffering from too much negativity and decides to remove any instances of words that start with ''un'' from their publications. Words such as ''unhappy'' will be replaced with the text ''CENSORED'', and any other words will be left unchanged.

ifmatch('un',TEXT,'CENSORED',TEXT)

Again, a more relevant example of ''ifmatch'' appears below in ''pluralize''.

----

''ifmatch-nocase''

This third built-in function works just like ''ifmatch'' but with case-insensitive pattern matching. So a case-insensitive version of the ''un''-censorship function would censor ''unhappy'', ''Unhappy'', ''UNHAPPY'', etc.

ifmatch-nocase('un',TEXT,'CENSORED',TEXT)


Top of page

----

Bringing these all together

Now we'll show how these built-in expansion function can be used to write a very powerful expansion to handle plural morphology in English. The expansion ''pluralize'' shows a complicated expression using the built-ins ''ifmatch'' and ''regsub''. Here are the parts of the expression, in the order they appear:
  • If the word ends in a vowel + ''o'' or ''y'', the plural is formed by adding ''s''      
  • Else if the word ends in a consonant + ''o'' or ''y'', or if it ends in ''s'', ''sh'', ''ch'', or ''x'', the plural is formed by adding ''es'' (and in the case of words ending in ''y'', we first change the ''y'' to an ''i''
  • Else (so in all other cases) the plural is formed simply by adding ''s''
Examples of each of these cases:
  • ''buy —> buys'', ''boy —> boys'', ''goo —> goos''
  • ''go —> goes'', ''try —> tries'', ''lady —> ladies''
  • ''cat —> cats'', etc.
Of course there are some exceptions which would need to be handled manually, such as the usual irregular plurals (''children'', ''deer'', etc.) and other forms which don't follow the rules described above (''volcano —> volcanoes'').

def pluralize(Word) {
  ifmatch('^.*[aeiou][oy]$', Word, Word . s,
    ifmatch('^.*([sxoy]|sh|ch)$', Word, regsub('^(.*)y$', '\1i', Word) . es,
            Word . s))
}


This expansion uses nested ''ifmatch'' statements, as the ''ELSE-TEXT'' argument of the first instance of ''ifmatch'' is itself an ''ifmatch'' statement. The ''IF-TEXT'' argument of the second ''ifmatch'' statement is a use of ''regsub''. If the regular expressions aren't making sense, take a look at this tutorial on regular expressions in Python. If they still don't make sense, ask someone for assistance.

Now we can replace the ''normal-noun'' expansion we wrote above with this ''noun'' expansion which, together with the ''pluralize'' expansion above and the ''basic-noun'' expansion discussed earlier, allows for very concise words declarations for nouns.

def noun(Sing, Class) {
   basic-noun (Sing, pluralize(Sing), Class)
}

noun(book, thing)
noun(DVD, thing)
noun(glass, thing)
noun(church, thing)
noun(flower, thing)
noun(bath, thing)
noun(teacher, person)
noun(lady, person) # Pluralized (correctly) to 'ladies'
noun(boy, person)  # Pluralized (correctly) to 'boys'


Nouns with irregular plurals are declared with the ''basic-noun'' expansion, bypassing the ''pluralize'' expansion, which of course doesn't handle the irregular cases.

basic-noun(policeman, policemen, person)
basic-noun(volcano, volcanoes, thing)
basic-noun(deer, deer, thing)


Top of page

Complex morphology using expansions

The English pluralization example above shows in great detail how to use expansions and the built-in expansion functions to perform complex morphological analysis with OpenCCG.

Now here are some examples from a truly complex morphological system — Arabic nominal morphology.

UNDER CONSTRUCTION

Top of page

The testbed function of OpenCCG provides a nice way for testing the effects of changes in analysis throughout the grammar. A well-designed testbed contains a set of sentences (both grammatical and ungrammatical sentences) which cover the range of phenomena you want your grammar to cover, making sure the grammar gets all of the examples you want it to get but doesn't overgenerate.

To run the testbed, run the following command from the command line:

 $ ccg-test -norealization tinytiny-grammar.xml

Word declarations

This text is from Ben Wing's comments in ''tiny.ccg''.

The format of word declarations is

word STEM:FAMILY ...(ATTRS): FEATURES;

or

word STEM:FAMILY ...(ATTRS) { INFLECTED-FORM: FEATURES; ...}


where ''STEM'' is the word's stem, ''FAMILY'' is a list of the families that a word is part of, and ''ATTRS'' specifies any other attributes associated with the word. 

''FEATURES'' gives the word's features; these come from the ''feature{}'' declarations above. (NOTE: Only feature values whose features specify a "macro-tie" value — something in <> following the feature's name — can be used.  See above.)

''ATTRS'' is a list; each attribute is either a specification ''ATTRIBUTE=VALUE''  or a single ''VALUE'' (equivalent to ''class=VALUE'').  The useful attributes are
  • ''class'' — Semantic class of a word.
  • ''pred'' — Semantic predicate of a word, used in the logical form; if omitted, defaults to the word's stem.
  • ''excluded'' — List of excluded lexical categories.
  • ''coart'' — Boolean indicating that this entry is a coarticulation, eg a pitch accent, gesture, or other word-associated element.
  Any of ''FAMILY'', ''ATTRS'' and/or ''FEATURES'' can be omitted.

The second form above, with braces, is used for words with different inflections. Instead of specifying the features directly after the word,  you list the features for each inflection separately.  Note that ''*'' is  shorthand for the stem itself.

Note that there can be more than one ''word{}'' declaration for a single stem.

The families in ''FAMILY'' can be either a family name, from a ''family{}'' block, or a part of speech. (''ccg2xml'' will derive the appropriate parts of speech from any families given when creating the XML file.)  Note that the words associated with a particular family can be specified either by tagging each word with its family, by listing a family's words explicitly using the ''member'' declaration inside of a ''family{}'' block, or by a combination of the two.

Top of page
Comments