Wiki‎ > ‎OpenCCG‎ > ‎

Writing a grammar from scratch

Writing a grammar from scratch

Main UT OpenCCG page

This page addresses the process of writing a new grammar with OpenCCG and VisCCG, from the ground up. There are a number of tutorials to familiarize yourself with the system before starting a new grammar. These can be found from the main Main UT OpenCCG page.

Grammar writing is a complex process, and there is no single right procedure. Rather than being a step-by-step tutorial, this page aims to raise and discuss a number of the decisions and choices to be made during grammar development.
It is by no means exhaustive.

At the same time, it really doesn't take long to get a new grammar going with VisCCG. Just start small and build the grammar up as you go.

A new grammar

Imagine we have a talking robot with the ability to generate speech based on the type of logical forms produced by OpenCCG, and we want to provide the option of having the robot speak in Spanish. So we want to develop a small Spanish grammar in order to produce logical forms for simple sentences. (This description ignores many, many details....)

Here's an example of a logical form produced by the ''tiny'' grammar for the sentence ''she bought the policeman a flower'':

s:
 @b1:action(buy ^ <tense>past ^
            <Actor>(p1:animate-being
                       ^ pro3f ^ <num>sg) ^
            <Beneficiary>(p2:person
                       ^ policeman ^ <det>the ^ <num>sg) ^
            <Patient>(f1:thing
                       ^ flower ^ <det>a ^ <num>sg))

Detailed discussion of logical forms appears later.

OpenCCG uses hybrid logic dependency semantics (HLDS) in its logical forms. The logical form is produced in parallel with the parse, and the output is a very detailed meaning representation.

The first step in writing a computational grammar is to determine the scope of that grammar. Unless you're hopelessly unrealistic, you won't at this point start by trying to write a broad-coverage grammar for your language. Rather you'll start with a small set of phenomena and build out from that core grammar.

The grammar most likely will be for a single language, and you may narrow the scope further to discussion of a particular domain in a language (like the World Cup grammar in Baldridge 2002, for example).

Top of page
Phenomena
Always start your grammar small and build out from a basic core. It's helpful to think about what you want to cover, but you don't need to figure everything out before you start writing the grammar.

One place to start is determining the sentences or strings to be accepted by the grammar. (The testbed is a convenient way of organizing your thoughts about this.)

Next, think about what sorts of grammatical phenomena you want the grammar to cover. Do you want to handle transitive and intransitive verbs? Are those even relevant concepts for analyzing the language you're working with? Do you want to handle some particular phenomenon, like clitic climbing or long-distance extraction? Whether you're working on a language you speak or a language you're working on the first time, you'll need to get a basic handle on the syntactic facts in the language.
Lexicon
One way of clarifying the phenomena you want to handle is to develop a set of clauses and/or phrases to be parsed by the grammar — this is your testbed. Of course, you can choose any lexical items you choose, and in this way you begin to build up your lexicon.

Simple start: intransitive verbs, bare NPs

For the Spanish grammar, we'll start with basic intransitive clauses and bare NPs. We don't want to handle agreement and inflection yet, but we do want to keep these things in mind. There are three basic regular verb types (''-ar'', ''-er'', ''-ir'') in Spanish, so I've chosen one verb of each type (''bailar'', ''comer'', and ''vivir'') for this first set of sentences. I'm using the plural form ''tortugas'' so that I don't have to deal with determiners yet.

tortugas bailan ('turtles dance')
tortugas comen ('turtles eat')
tortugas viven  ('turtles live')

The section on predicates below explains how to get English (or any other language, really) predicates into the representation.

Running the grammar

See the tutorial for a refresher on using ''tccg''. Also check out some tips and tricks.

We want to get started parsing as soon as possible. In order to successfully parse these sentences, OpenCCG needs these things:
  • lexical categories for the words in the sentences
  • declaration of the words in the sentences
  • any features involved in either the word declarations or the category definitions mentioned above
You can start with either the categories or the word declarations, but you won't be able to parse until you have them both. I'll start with categories.

Top of page
Lexical families
Much of the material in this section has been taken pretty much directly from the Rough Guide to OpenCCG

Traditionally, the lexicon for a categorial grammar specifies for each word its own category. In OpenCCG, categories are instead organized into lexical families, which are related to whole sets of words. This makes it possible to avoid giving the same specification over and over again in a lexicon.

The simplest way in which words can be related to families is through their parts of speech: for a word we have to specify its part of speech, and for a family we have to specify the part of speech a word must have for the family to be applicable.

In each family, we define one or more entries, using an ''entry'' statement. Each entry defines a category with accompanying feature structure and logical form. An example of a family with multiple (ok, two) entries is the ditransitive verb in English. One entry would specify two NP complements, the other one NP and for-PP complement.

To declare the lexical families, of course, we need a CCG analysis of the sentences. No surprises here; I'll use the category ''n'' for  the noun, and ''s\n'' for the intransitive verb.

family N {
  entry: n;
}

family IntransV(V) {
  entry: s \ n ;
}
Word declarations
Now the words need to be declared. It's up to you how detailed to be with both the category definitions and  the word declarations. In particular, you can deal with features now or later, and you can deal with semantics now or later. Simple word declarations such as these are sufficient for parsing:

word tortugas:N;
word bailan:IntransV;
word comen:IntransV;
word viven:IntransV;

The grammar can now be loaded, and these three sentences can be parsed.

Top of page

Features and morphology

For an introduction to features in OpenCCG, see the VisCCG tutorial.

Now we've got the grammar working and parsing a (very) small set of sentences. But now I want to be able to talk about just one ''tortuga'' as well as multiple ''tortugas''. The brute-force way to do this would be to write a separate declaration:

word tortuga:N;

Of course, this would overgenerate. We could also declare the singular form for ''dance'' — ''baila'', but now all four of these sentences would be allowed by the grammar:

tortugas bailan
*tortugas baila
tortuga baila
*tortuga bailan

There are two problems here:
  • we need a way of handling morphology efficiently
  • we need a way of constraining subject-verb agreement
Features are involved in both cases. First we need to incorporate features into word declarations. Then we need to appeal to these features in the lexical families.

The VisCCG tutorial provides an introduction to complex word declarations and expansions. There's even more information in the advanced topics page on word declarations.
Noun features
In this case we need a few different elements to handle the nominal morphology:
  • an expansion for nouns declaring singular and plural forms (very similar to the one in ''tinytiny.ccg'', but without the semantic information for now)
  • feature declarations for ''sg'' and ''pl''
  • a revised declaration for ''tortuga''
def noun(Sing, Plur) {
  word Sing:N {
    *: sg;
    Plur: pl;
  }
}

noun(tortuga,tortugas)

Now that feature structures have been introduced, we need to revise the lexical families to incorporate feature structures.

If we parse ''tortuga'' and ''tortugas'' with '':feats'' on, we get the following representations:

tccg> :feats
tccg> tortugas
1 parse found.

Parse: n<1>{NUM=pl}
------------------------------
(lex)  tortugas :- n<1>{NUM=pl}

tccg> tortuga
1 parse found.

Parse: n<1>{NUM=sg}
------------------------------
(lex)  tortuga :- n<1>{NUM=sg}

Top of page
Subject-verb agreement
To get subject-verb agreement right, we'll use both the ''num'' feature and a person feature, as well as expansions for verbal morphology. The solution shown below is still a little hacky, but you can see right away how much using the expansions streamlines the morphological component of the grammar.

def ar-verb(Stem) {
   word Stem:IntransV {
     Stem . o: 1st sg;
     Stem . as: 2nd sg;
     Stem . a: 3rd sg;
     Stem . amos: 1st pl;
     Stem . ais: 2nd pl;
     Stem . an: 3rd pl;
   }
}

ar-verb(bail)

Note that this expansion makes use of the ''.'' operator, which simply concatenates two strings. This is the simplest  of the built-in expansion functions.
Now we recompile and see the additional features. We can write similar expansions for ''-er'' verbs and ''-ir'' verbs.

tccg> baila
1 parse found.

Parse: s\n<1>{NUM=sg, PERS=3rd}
------------------------------
(lex)  baila :- s\n<1>{NUM=sg, PERS=3rd}

tccg> bailan
1 parse found.

Parse 1: s\n<2>{NUM=pl, PERS=3rd}
------------------------------
(lex)  bailan :- s\n<2>{NUM=pl, PERS=3rd}

tccg>

One more detail: we want to be sure to rule out ungrammatical sentences like the following, so we need to include the person feature in the noun expansion:

*tortuga baila (1st person singular verb form)
*tortuga bailas (2nd person singular verb form)

This is a fairly simple version of the morphology. Built-in expansion functions can be used to implement very complex morphological analysis. If you're ready for some very complex morphology, check out the Arabic grammar.

You'll also want to be able to handle irregular word forms. You might do this be writing individual word declarations for irregular verbs, or you could use an intermediate-level expansion to pass the arguments to a word declaration.

def irr-verb(Family,Stem,FirstSg,SecSg,ThirdSg,FirstPl,SecPl,ThirdPl){
    word Stem:Family {
       FirstSg: 1st sg;
       SecSg: 2nd sg;
       ThirdSg: 3rd sg;
       FirstPl: 1st pl;
       SecPl: 2nd pl;
       ThirdPl: 3rd pl;
    }
   }

irr-verb(IntransV,estoy,estas,esta,estamos,estais,estan)

The ''Family'' argument allows us to use the same ''irr-verb'' expansion for transitive, intransitive, or ditransitive verbs. See the Spanish grammar.

Top of page

Semantics

At this point, our grammar is parsing simple intransitive sentences with accurate handling of subject-verb agreement. At this point, given a sentence, what the grammar can do is tell us whether or not that sentence is grammatical.
Wouldn't it be great to get a meaning representation too? One of the beauties of OpenCCG is that, given the semantic information, it produces a logical form in parallel with the parse.

Handling semantics involves a few different aspects of the grammar:
  • assignment of logical forms to categories
  • semantic variables and semantic classes
  • predicates
  • semantic features
Logical forms
Much of this material is adapted from the Rough Guide to OpenCCG.

To associate meanings with categories, we need to take care of two things: the structure of the meaning (logical form) itself and the relation between the category and that meaning. The latter usually comes down to specifying how the meanings of arguments are to be fit into the logical form (LF).

The simplest LF is of the form @XΦ, where Φ is a proposition. Following linguistic tradition, we use a word's stem to represent its meaning, unless another predicate is specified. In the representation shown below, ''*'' is used to represent the default proposition (i.e. the stem). The X in the form above can be interpreted as the discourse referent of the proposition. The @ is the satisfaction operator.

LFs are declared as part of lexical family entries. Let's look first at the generalized LF for nouns.

family N {
  entry: n<2> : X(*);
 }

The entry consists of the category (''n'') for the entry with a feature-structure id (''<2>''). The LF is separated from the category by a colon. Recall that the LF should show both a discourse referent and a proposition. Here the discourse referent is represented by the variable ''X'', and the ''*'' shows the proposition to default to the stem.

If we parse ''tortugas'' with semantics on ('':sem''), this is the result:

tccg> :sem
tccg> tortugas      
1 parse found.

Parse: n<1>{NUM=pl, PERS=3rd} :
  @t1(tortuga)
------------------------------
(lex)  tortugas :- n<1>{NUM=pl, PERS=3rd} : @X_0(tortuga)

In the grammar we use logical variables for nominals, but during parsing OpenCCG instantiates these variables dynamically. The resulting LF here is ''@X_0(tortuga)'' — satisfaction operator (''@''), discourse referent (''X_0''), and proposition (''tortuga'').

At this stage it's very important to ensure that the grammar is getting the dependencies right when parsing. One way to check that is to give it a sentence to parse and look closely at the resulting logical form. We can try this LF for intransitive verbs:

family IntransV(V) {
  entry: s<1> \ np<2>:
         E(* <Actor>X);
}

In this LF, the discourse referent is ''E'', and the proposition is ''(* <Actor>X)''. The proposition is composed of the predicate (here represented by the default ''*'') and any arguments to that predicate. Here we have one argument — the argument is given the index ''Actor'' and is represented by (or rather, its discourse referent is represented by) the variable ''X''.

The LF OpenCCG produces for ''tortugas bailan'' raises some flags — look at the last line of this representation:

tccg> tortugas bailan
1 parse found.

Parse: s<2> :
  (@b1(bail ^
      <Actor>x1) ^ @t1(tortuga))
------------------------------
(lex)  tortugas :- n<1>{NUM=pl, PERS=3rd} : @X_0(tortuga)
(lex)  bailan :- s<2>\n<3>{NUM=pl, PERS=3rd} : (@E_1(bail) ^ @E_1(<Actor>X_1))
(<)    tortugas bailan :- s<2> : (@E_1(bail) ^ @E_1(<Actor>X_1) ^ @X_0(tortuga))

This parse has three discourse referents: ''E_1'' (the dancing event), ''X_1'' (the actor participant in that event), and ''X_0'' (a free-floating turtle). To fix the dependencies, we need semantic variables!

Top of page
Semantic variables
We use semantic variables to co-index elements of a lexical entry's category with discourse referents in the logical form. Generally (though you can change it if you want) the variable ''E'' is used as the main event variable, and ''X'', ''Y'', and ''Z'' are used to represent participants in the action/event/state.
The semantic variables are given as features of category elements in the square braces used for specifying features. The lexical families are modified accordingly.

family N {
  entry: n<2> [X] : X(*);
 }

family IntransV(V) {
  entry: s<1> [E] \ n<2> [X] :
     E(* <Actor>X);
 }

This is one of the places where VisCCG is very useful — as you modify the lexical families, pay attention to the changes in the ''Lexicon'' view tab.

Now we get the right dependencies:

tccg> tortugas bailan
1 parse found.

Parse: s<2>{index=E_1} :
  @b1(bail ^
      <Actor>(t1 ^ tortuga))
------------------------------
(lex)  tortugas :- n<1>{NUM=pl, PERS=3rd, index=X_0} : @X_0(tortuga)
(lex)  bailan :- s<2>{index=E_1}\n<3>{NUM=pl, PERS=3rd, index=X_1} : (@E_1(bail) ^ @E_1(<Actor>X_1))
(<)    tortugas bailan :- s<2>{index=E_1} : (@E_1(bail) ^ @E_1(<Actor>X_0) ^ @X_0(tortuga))


Top of page
Predicates
When you're working with a new language, you may prefer to see meaning representations in a familiar language, or you may want something other than just the stem to appear as the predicate. For example, using the verbal stem ''bail'' rather than the infinitive ''bailar'' is a little awkward.

The predicates which appear in the logical forms produced by the OpenCCG parser are drawn from word declarations.

word tortuga:N (pred=turtle): sg;

If no predicate is specified in the word declaration, the default will be used. Recall that the default is the stem or lemma — this is always the first thing following the word ''word'' in the declaration. When words are declared with expansions, the stem is usually passed in as an argument of the expansion (''Sing'' in the noun expansion below).

def noun(Sing, Plur) {
  word Sing:N {
    *: sg;
    Plur: pl;
  }
}

noun(tortuga,tortugas)

This expansion can be modified to include the specified predicate.

def noun(Sing, Plur, Pred) {
  word Sing:N (pred=Pred) {
    *: sg;
    Plur: pl;
  }
}

Once predicates have been supplied for noun and verb expansions in the Spanish grammar, the resulting parse includes a logical form with English predicates.

tccg> :sem
tccg> tortugas bailan
1 parse found.

Parse: s<2>{index=E_1:action} :
  @d1:action(dance ^
             <Actor>(t1:animate-being ^ turtle))
------------------------------
(lex)  tortugas :- n<1>{NUM=pl, PERS=3rd, index=X_0:sem-obj} : @X_0:sem-obj(turtle)
(lex)  bailan :- s<2>{index=E_1:action}\n<3>{NUM=pl, PERS=3rd, index=X_1:animate-being} : (@E_1:action(dance) ^ @E_1:action(<Actor>X_1:animate-being))
(<)    tortugas bailan :- s<2>{index=E_1:action} : (@E_1:action(dance) ^ @E_1:action(<Actor>X_1:animate-being) ^ @X_1:animate-being(turtle))

If you're interested in more detailed information, see the word declarations section on the advanced topics page.

Top of page

What's missing in the logical form now are features like number and person.
Semantic features
Semantic features allow the user to add semantic information to the logical form. (We could compare these to the morphosyntactic features we've seen so far, which are generally used in parsing but do not carry over to the LF.)
Semantic features must be declared in the feature hierarchy. Compare the ''num'' and ''sem-num'' features shown below (the rest of the feature hierarchy has been omitted for clarity):

   NUM<2>: sg pl;
   SEM-NUM<X>: sg-X pl-X;

Where the ''num'' feature links to a feature-structure id (''<2>''), the ''sem-num'' feature links to a semantic variable (''<X>''). Because the features have the same possible values but link to different types of entities, it might seem desirable to give the same names to the possible values of these two closely-related features (so ''sem-num<X>: sg pl;''). The problem with that solution is that there would be no way to differentiate between the two in word declarations, which is where the semantic features are declared.

def noun(Sing, Plur, Pred) {
  word Sing:N (pred=Pred) {
    *: sg sg-X;
    Plur: pl pl-X;
  }
}

The resulting parse for ''tortugas'' looks like this:

tccg> tortugas
1 parse found.

Parse: n<1>{NUM=pl, PERS=3rd, index=X_0:sem-obj}:
  @t1:sem-obj(turtle ^
              <X>pl-X)
------------------------------
(lex)  tortugas :- n<1>{NUM=pl, PERS=3rd, index=X_0:sem-obj} : (@X_0:sem-obj(turtle) ^ @X_0:sem-obj(<X>pl-X))

The representation can be refined by specifying a name for the semantic feature. (In the current LF it's represented as ''<X>pl-X'', not the most informative name possible.) This is done in the feature hierarchy.

SEM-NUM<X:NUM>: sg-X pl-X;

and results in a more informative LF:

tccg> tortugas
1 parse found.

Parse: n<1>{NUM=pl, PERS=3rd, index=X_0:sem-obj}:
  @t1:sem-obj(turtle ^
              <NUM>pl-X)
------------------------------

One last note about semantic features: before you go about incorporating semantic features into every last word declaration, think about what information you want to be represented in the LF. While you could put semantic number features on the verbs as well in the nouns, for languages like Spanish this would result in redundant information and a cluttered logical form.

Because semantic features are linked to semantic variables, we can also use semantic features which are associated with the entire clause. An example of this is a tense feature, declared like this in the feature hierarchy:

TENSE<E>: past present;

in the word declaration (this is an example from the tinytiny English grammar):

def verb(stem, props, 3sing, pasttense) {
  word stem:props {
    *: pres non-3rd sg;
    3sing: pres 3rd sg;
    *: pres pl;
    pasttense: past;
  }
}

verb(go, IntransV, goes, went)

and referenced in the lexical family entry:

family IntransV(V) {
  entry: s<1> [E]\n<2> [X]: E:action(* <Actor>X:animate-being)
}

with a resulting LF:

tccg> tortugas bailan
1 parse found.

Parse: s<2>{index=E_1:action}:
  @d1:action(dance ^
             <E>pres ^
             <Actor>(t1:animate-being ^ turtle >
                     <NUM>pl-X))
------------------------------

Top of page
Semantic types
If you want to use semantic types as declared in the ontology (again see the features section of the tutorial), these are specified as a property of the discourse referent in the LF.

family N {
  entry: n<2> [X] : X:sem-obj(*);
 }

Both nominals and events can have types (or sorts) associated with them. In the entry above, ''X'' can be any subtype of semantic object. It is common to use types such as ''animate-being'' or ''person'' to restrict particular verbal arguments — we see this sort of type in the LF for the main ''IntransV'' entry. In this entry we also see a semantic type associated with the event discourse referent of the clause.

family IntransV(V) {
  entry: s<1> [E]\n<2> [X]: E:action(* <Actor>X:animate-being)
}

Top of page

Expanding the grammar

Pronouns and other function words
Transitive verbs, determiners, adjectives
Adverbs, coordination
Noun-adjective agreement
Relativization
Comments