Main UT OpenCCG page
This page addresses the process of writing a new grammar with OpenCCG and VisCCG, from the ground up. There are a number of tutorials to familiarize yourself with the system before starting a new grammar. These can be found from the main Main UT OpenCCG page.
Grammar writing is a complex process, and there is no single right procedure. Rather than being a step-by-step tutorial, this page aims to raise and discuss a number of the decisions and choices to be made during grammar development.
It is by no means exhaustive.
At the same time, it really doesn't take long to get a new grammar going with VisCCG. Just start small and build the grammar up as you go.
Spanish grammar in order to produce logical forms for simple sentences. (This description ignores many, many details....)
Here's an example of a logical form produced by the ''tiny'' grammar for the sentence ''she bought the policeman a flower'':
Detailed discussion of logical forms appears later.
OpenCCG uses hybrid logic dependency semantics (HLDS) in its logical forms. The logical form is produced in parallel with the parse, and the output is a very detailed meaning representation.
The first step in writing a computational grammar is to determine the scope of that grammar. Unless you're hopelessly unrealistic, you won't at this point start by trying to write a broad-coverage grammar for your language. Rather you'll start with a small set of phenomena and build out from that core grammar.
The grammar most likely will be for a single language, and you may narrow the scope further to discussion of a particular domain in a language (like the World Cup grammar in Baldridge 2002, for example).
Top of page
One place to start is determining the sentences or strings to be accepted by the grammar. (The testbed is a convenient way of organizing your thoughts about this.)
Next, think about what sorts of grammatical phenomena you want the grammar to cover. Do you want to handle transitive and intransitive verbs? Are those even relevant concepts for analyzing the language you're working with? Do you want to handle some particular phenomenon, like clitic climbing or long-distance extraction? Whether you're working on a language you speak or a language you're working on the first time, you'll need to get a basic handle on the syntactic facts in the language.
Spanish grammar, we'll start with basic intransitive clauses and bare NPs. We don't want to handle agreement and inflection yet, but we do want to keep these things in mind. There are three basic regular verb types (''-ar'', ''-er'', ''-ir'') in Spanish, so I've chosen one verb of each type (''bailar'', ''comer'', and ''vivir'') for this first set of sentences. I'm using the plural form ''tortugas'' so that I don't have to deal with determiners yet.
The section on predicates below explains how to get English (or any other language, really) predicates into the representation.
tccg''. Also check out some tips and tricks.
We want to get started parsing as soon as possible. In order to successfully parse these sentences, OpenCCG needs these things:
Top of page
Rough Guide to OpenCCG
Traditionally, the lexicon for a categorial grammar specifies for each word its own category. In OpenCCG, categories are instead organized into lexical families, which are related to whole sets of words. This makes it possible to avoid giving the same specification over and over again in a lexicon.
The simplest way in which words can be related to families is through their parts of speech: for a word we have to specify its part of speech, and for a family we have to specify the part of speech a word must have for the family to be applicable.
In each family, we define one or more entries, using an ''entry'' statement. Each entry defines a category with accompanying feature structure and logical form. An example of a family with multiple (ok, two) entries is the ditransitive verb in English. One entry would specify two NP complements, the other one NP and for-PP complement.
To declare the lexical families, of course, we need a CCG analysis of the sentences. No surprises here; I'll use the category ''n'' for the noun, and ''s\n'' for the intransitive verb.
The grammar can now be loaded, and these three sentences can be parsed.
Top of page
Now we've got the grammar working and parsing a (very) small set of sentences. But now I want to be able to talk about just one ''tortuga'' as well as multiple ''tortugas''. The brute-force way to do this would be to write a separate declaration:
Of course, this would overgenerate. We could also declare the singular form for ''dance'' — ''baila'', but now all four of these sentences would be allowed by the grammar:
There are two problems here:
The VisCCG tutorial provides an introduction to complex word declarations and expansions. There's even more information in the advanced topics page on word declarations.
Now that feature structures have been introduced, we need to revise the lexical families to incorporate feature structures.
If we parse ''tortuga'' and ''tortugas'' with '':feats'' on, we get the following representations:
Top of page
Note that this expansion makes use of the ''.'' operator, which simply concatenates two strings. This is the simplest of the built-in expansion functions.
Now we recompile and see the additional features. We can write similar expansions for ''-er'' verbs and ''-ir'' verbs.
One more detail: we want to be sure to rule out ungrammatical sentences like the following, so we need to include the person feature in the noun expansion:
This is a fairly simple version of the morphology. Built-in expansion functions can be used to implement very complex morphological analysis. If you're ready for some very complex morphology, check out the Arabic grammar.
You'll also want to be able to handle irregular word forms. You might do this be writing individual word declarations for irregular verbs, or you could use an intermediate-level expansion to pass the arguments to a word declaration.
The ''Family'' argument allows us to use the same ''irr-verb'' expansion for transitive, intransitive, or ditransitive verbs. See the Spanish grammar.
Top of page
Wouldn't it be great to get a meaning representation too? One of the beauties of OpenCCG is that, given the semantic information, it produces a logical form in parallel with the parse.
Handling semantics involves a few different aspects of the grammar:
To associate meanings with categories, we need to take care of two things: the structure of the meaning (logical form) itself and the relation between the category and that meaning. The latter usually comes down to specifying how the meanings of arguments are to be fit into the logical form (LF).
The simplest LF is of the form @XΦ, where Φ is a proposition. Following linguistic tradition, we use a word's stem to represent its meaning, unless another predicate is specified. In the representation shown below, ''*'' is used to represent the default proposition (i.e. the stem). The X in the form above can be interpreted as the discourse referent of the proposition. The @ is the satisfaction operator.
LFs are declared as part of lexical family entries. Let's look first at the generalized LF for nouns.
The entry consists of the category (''n'') for the entry with a feature-structure id (''<2>''). The LF is separated from the category by a colon. Recall that the LF should show both a discourse referent and a proposition. Here the discourse referent is represented by the variable ''X'', and the ''*'' shows the proposition to default to the stem.
If we parse ''tortugas'' with semantics on ('':sem''), this is the result:
In the grammar we use logical variables for nominals, but during parsing OpenCCG instantiates these variables dynamically. The resulting LF here is ''@X_0(tortuga)'' — satisfaction operator (''@''), discourse referent (''X_0''), and proposition (''tortuga'').
At this stage it's very important to ensure that the grammar is getting the dependencies right when parsing. One way to check that is to give it a sentence to parse and look closely at the resulting logical form. We can try this LF for intransitive verbs:
In this LF, the discourse referent is ''E'', and the proposition is ''(* <Actor>X)''. The proposition is composed of the predicate (here represented by the default ''*'') and any arguments to that predicate. Here we have one argument — the argument is given the index ''Actor'' and is represented by (or rather, its discourse referent is represented by) the variable ''X''.
The LF OpenCCG produces for ''tortugas bailan'' raises some flags — look at the last line of this representation:
This parse has three discourse referents: ''E_1'' (the dancing event), ''X_1'' (the actor participant in that event), and ''X_0'' (a free-floating turtle). To fix the dependencies, we need semantic variables!
Top of page
The semantic variables are given as features of category elements in the square braces used for specifying features. The lexical families are modified accordingly.
This is one of the places where VisCCG is very useful — as you modify the lexical families, pay attention to the changes in the ''Lexicon'' view tab.
Now we get the right dependencies:
Top of page
The predicates which appear in the logical forms produced by the OpenCCG parser are drawn from word declarations.
If no predicate is specified in the word declaration, the default will be used. Recall that the default is the stem or lemma — this is always the first thing following the word ''word'' in the declaration. When words are declared with expansions, the stem is usually passed in as an argument of the expansion (''Sing'' in the noun expansion below).
This expansion can be modified to include the specified predicate.
Once predicates have been supplied for noun and verb expansions in the Spanish grammar, the resulting parse includes a logical form with English predicates.
If you're interested in more detailed information, see the word declarations section on the advanced topics page.
Top of page
What's missing in the logical form now are features like number and person.
Semantic features must be declared in the feature hierarchy. Compare the ''num'' and ''sem-num'' features shown below (the rest of the feature hierarchy has been omitted for clarity):
Where the ''num'' feature links to a feature-structure id (''<2>''), the ''sem-num'' feature links to a semantic variable (''<X>''). Because the features have the same possible values but link to different types of entities, it might seem desirable to give the same names to the possible values of these two closely-related features (so ''sem-num<X>: sg pl;''). The problem with that solution is that there would be no way to differentiate between the two in word declarations, which is where the semantic features are declared.
The resulting parse for ''tortugas'' looks like this:
The representation can be refined by specifying a name for the semantic feature. (In the current LF it's represented as ''<X>pl-X'', not the most informative name possible.) This is done in the feature hierarchy.
and results in a more informative LF:
One last note about semantic features: before you go about incorporating semantic features into every last word declaration, think about what information you want to be represented in the LF. While you could put semantic number features on the verbs as well in the nouns, for languages like Spanish this would result in redundant information and a cluttered logical form.
Because semantic features are linked to semantic variables, we can also use semantic features which are associated with the entire clause. An example of this is a tense feature, declared like this in the feature hierarchy:
in the word declaration (this is an example from the tinytiny English grammar):
and referenced in the lexical family entry:
with a resulting LF:
Top of page
features section of the tutorial), these are specified as a property of the discourse referent in the LF.
Both nominals and events can have types (or sorts) associated with them. In the entry above, ''X'' can be any subtype of semantic object. It is common to use types such as ''animate-being'' or ''person'' to restrict particular verbal arguments — we see this sort of type in the LF for the main ''IntransV'' entry. In this entry we also see a semantic type associated with the event discourse referent of the clause.
Top of page