TüNDRA

Tübingen aNnotated Data Retrieval Application

Web tool for treebank research

Detailed Guide to TüNDRA Query Syntax

Nodes and Attributes

Nodes are specified between square brackets. Queries for node matching an particular attribute value are always in the form:

[Attribute=Value]

String values must be in quotes.

[cat="NX"]

[word="aber"]

[pos="NN"]

[edge="obl"]

Some sentences in some treebanks have a 'multiword' column, which can be used to search multi-token words which do not appear in their original form in TÜNDRA. For example, UD does not consider contractions (e.g. English "can't", German "im") to be single tokens, and breaks these up into two tokens. To find such words in their original form, the following query can be used:

[multiword="im"].2"Rahmen"

This will search for instances of sentences which contained the string ("im Rahmen") in their original form. The '2' is necessary here to skip over the second token since the 'multiword' column only has a value for the first token of a multi-token word.

Boolean operators

TüNDRA supports two boolean operations: & (AND) and| (OR). These can join relations and individual node specifications.

[word="Berlin"] | [word="Bonn"]

[word="Berlin"] & [cat="NX"]

[word="Berlin"] & [word="haben"]

These operations can also join attribute specifications inside of node specifications, as described below.

Order of operations

The AND operator (&) has priority over OR(|). Use parentheses to fix the order of operations. The following two queries are not equivalent:

[word="Berlin"] & [word="Bonn"] | [word="Kanada"]

[word="Berlin"] & ([word="Bonn"] | [word="Kanada"])

The same order of operations and parentheses apply to attributes inside of node specifications.

Reserved and special characters in attributes

The quote character (") must be preceded by a backslash (\) in any attribute. To find the quote, use:

[word="\""]

To search for the backslash character, it must also be preceded by a backslash:

[word="\\"]

When doing simple searches with quotes, all other characters should match correctly without backslashes or other particular markers. Backslashes other than ones used before the quote or the backslash are automatically deleted from attribute queries. The following two queries are therefore identical to TüNDRA:

[word="US-\$"]

[word="US-$"]

The rules for escaping characters are different for regular expressions. See the section on regular expressions for details.

NB: The attributes word and token are coded to be equivalent. The queries [word="foo"] and [token="foo"]have identical results.

Special simple queries

Entering a simple string between quotes matches any node with a token,lemma, category, or subcategory that matches that string.

So, the query

"Kanada"

is equivalent to:

[token="Kanada" | lemma="Kanada" | cat="Kanada" | subcat="Kanada"]

Unicode

TüNDRA uses Unicode at all levels, and Unicode characters can be used directly to match attributes in queries. Unicode characters can also be referenced numerically in both string values (between quotes - "") and in regular expressions. For example, to query for the ndash character (–), users can write – or ‒.

Furthermore, the non-breaking space character can be queried using . Other HTML named entities are not currently supported.

Regular Expressions

Attribute values can also match regular expressions. Regular expressions are always between forward slashes (/). TüNDRA followsXQuery pattern matching and regular expression syntax, with some changes to enhance TIGERSearch compatibility.

All regular expressions must match the entire string, as if preceded with ^ and followed by $.

[cat=/F.+N/]

[word=/ge.+ß/]

[lemma=/[dD]eut.*/]

[word=/.*[aöüAÖÜ].*/]

In general, TüNDRA supports the most common regular expression operations. Below is summary of the most common operations supported in pattern matching:

`.`	unspecified character	`[word = /sag./]`
`*`	unrestricted repetition	`[lemma = /spiel.*/]`
`+`	repetition with minimum 1	`[word = /.+[0-9A-Z]+.*/]`
`?`	optionality	`[word = /(Leben)s?/]`
`[ ]`	character set	`[word = /.+[0-9A-Z]+.*/]`
`^`	negated character sets	`[word = /[^0-9A-Z].*/]`
`( )`	grouping	`[word = /([lmnp][aeiou])+/]`
`\|`	disjunction	`[word = /[dD](as\|er)/]`
`\`	escape for reserved characters	`[word = /.\-./]`

Reserved and special characters in regular expressions

The following characters are reserved in regular expression syntax and must be preceded by a backslash if they are to be matched:

( ) [ ] . + $ * ? | / \

For example, to search for any token containing the dollar sign ($), use:

[word=/.*\$.*/]

In addition, for compatibility with TIGERSearch, backslashes are removed from pattern matches if they precede the following characters:

! # , - : ; = > @ & " '

For example, to find all tokens with apostrophes, the following queries are equivalent:

[word=/.*\'.*/]

[word=/.*'.*/]

All other backslashes are preserved, even where they result in empty search results, because of the role they play in regular expression syntax. When a pattern match behaves unexpectedly, make sure any backslashes are placed according to XQuery regular expression syntax and this document.

Multiple values

Attribute values can be specified by disjunctions of multiple values:

[cat=("VXFIN"|"VXINF")]

[lemma=("Hund"|"Katz"|"Vogel")]

[cat=("NX"|/V.*/)]

[edge=("csubj"|"nsubj")]

All disjunctions of attribute values must be between parentheses and separated by the bar character (|).

Negation

Attribute-value pairs can be negated by replacing the equals sign= with != :

[cat!="NX"]

[lemma!=/.*[äöüßÄÖÜ].*/]

[rel!="nsubj"]

Reserved Characters

When a value contains a reserved character, it may have to be searched as a regular expression for certain queries in order to avoid syntax errors.

[pos = "NOUN"] > /acl:relcl/ [pos = "VERB"]

Multiple attributes

Nodes can be specified with more than one attribute:

[pos="NN" & lemma=/ge.*/]

[pos="NOUN" & lemma=/ge.*/]

[pos=/V.*/ | lemma!=/.*en/]

Both conjunctions (&) and disjunctions (|) can be combined. The default order of operations is conjunction before disjunction. The two node specifications below are equivalent:

[pos=/V.*/ | lemma=/.*en/ & word=/[A-Z].*/]

[pos=/V.*/ | (lemma=/.*en/ & word=/[A-Z].*/)]

But they are not equivalent to:

[(pos=/V.*/ | lemma=/.*en/) & word=/[A-Z].*/]

Attribute variables

Variables always start with # and can be followed by any sequence of letters and numbers:

[lemma=#1:/a.*/]

[lemma=#a1:/a.*/]

[lemma=#starts_with_a:/a.*/]

Variables make it possible to query for nodes with the same values for multiple attributes:

[lemma=#1:/a.*/ & word=#1]

Variables also make it possible to query for nodes with some matchings attributes. To search for sentence with two identical words starting with the letter "B", either of the following will work:

[word=#1:/B.*/] .* [word=#1]

[#1:word=/B.*/] .* [#1]

Node Variables

Matched nodes may also be assigned to variables:

#city:[lemma=("Berlin"|"Hamburg"|"Stuttgart")]

#np1:[cat="NX"]

Node variables can be used in relations, so that whole tree structures are specified:

An example from a constituency treebank:

#1:[cat="NX"] >* [word="Berlin"] & #1 >* [lemma="Flüchtling"]

This query matches the structure below:

An example from a dependency treebank:

#1:[pos="VERB"] >* [word="dem"] & #1 >* [lemma="Land"]

This query matches the structure below:

Node Classes

TüNDRA supports queries on two general node classes in constituent trees:terminal and non-terminal.

Terminal nodes are nodes without descendants. They are be queried with [T].
Non-terminal nodes are nodes with descendants. They are be queried with [NT].

Node classes can be combined with attribute value queries:

[T & pos="NN"]

[NT & cat=/V.*/]

More commonly, these classes are used to query for arbitrary nodes in some relation to another:

Examples from a constituency treebank:

[NT] > [pos="NN"]

[cat="PX"] >* [T & pos!=/N.*/]

An example from a dependency treebank:

[NT & pos="NOUN"] > [pos!="NOUN"]

Relations

All of the relations supported by TüNDRA are binary relations - they only support relations between two nodes. To query structures of three or more nodes requires node variables.

Ancestry and descent

Immediate parent-child relations are indicated with the ">" operator. (right)

Constituency Treebank Examples:

[cat="NX"] > [lemma="Berlin"] (See figure above)

[cat="MF"] > [cat="NX"]

Dependency Treebank Examples:

[word="haben"] > [word="das"]

The ">*" operator is used to indicate ancestor-descent relations of any distance. (below)

Constituency Treebank Examples:

[cat="MF"] >* [cat="NX"]

[cat="NX"] >* [lemma="Berlin"]

Dependency Treebank Examples:

[word="haben"] >* [word="das"]

To indicate ancestor-descent at a fixed distance, add a number after the > operator:

Constituency Treebank Examples:

[cat="MF"] >2 [cat="NX"]

Dependency Treebank Examples:

[word="haben"] >2 [word="das"]

To indicate ancestor-descent at a bounded distance, add the bounding values separated by a comma after the ">" operator:

Constituency Treebank Examples:

[cat="MF"] >2,4 [cat="NX"]

Dependency Treebank Examples:

[word="haben"] >2,4 [word="das"]

Edge labels

To query on edges between nodes, append the edge label to the ">" operator. (right)

Constituency Treebank Examples:

[cat="NX"] >HD [pos="NN"] (See figure above)

[cat="MF"] >OA [cat="NX"]

Dependency Treebank Examples:

[word="Tat"] >case [word="in"]

To query for lists of edges, put them in parentheses separated by the bar (|):

Constituency Treebank Examples:

[cat="MF"] >(OA|ON) [cat="NX"]

Label edges can also be negated:

Constituency Treebank Examples:

[cat="MF"] >!OA [cat="NX"]

[cat="MF"] >!(OA|ON) [cat="NX"]

Labeled edges

In treebanks with a mixture of labeled and unlabeled edges, the relation ">%" finds nodes connected with labeled edges:

Constituency Treebank Examples:

[cat="NX"] >% [cat="NX"]

And to query for unlabeled edges, use ">!%":

Constituency Treebank Examples:

[cat="NX"] >!% [cat="NX"]

Positional parentage

To query on the leftmost and rightmost descendent of a non-terminal node, use the ">@l" and ">@r":

Constituency Treebank Examples:

[cat="NX"] >@l [lemma="der"]

[cat="NX"] >@r [pos="NE"]

Dependency Treebank Examples:

[word="die"] >@l [word="Länder"]

[word="Tausende"] >@r [word="Vertriebenen"]

To negate positional parentage, "!>@l" is equivalent to ">!@l", "!>@r" is equivalent to ">!@r", and "!>!@l", ">@l", "!>!@r", ">@r" are equivalent respectively.

Secondary edges

Secondary edges can be queried with the ">˜" operator:

Constituency Treebank Examples:

[cat="PX"] >~ [cat="NX"]

Dependency Treebank Examples:

[word="haben"] >~ [word="das"]

Secondary edge labels can be queried by appending them to the ">˜" operator:

Constituency Treebank Examples:

[cat="PX"] >~refmod [cat="NX"]

Dependency Treebank Examples:

[word="haben"] >~obj [word="das"]

Sequence

Adjacent nodes can be queried using the "." operator. (below)

Constituency Treebank Examples:

[cat="MF"] . [cat="VC"]

[word="der"] . [cat="NX"]

[word="der"] . [word="Mann"]

Dependency Treebank Examples:

[word="der"] . [word="Mann"]

The ".*" operator queries nodes that follow each other at any distance in the same sentence:

Constituency Treebank Examples:

[word="der"] .* [word="Mann"]

Dependency Treebank Examples:

[word="der"] .* [word="Mann"]

To indicate a sequence at a fixed distance, add a number after the "." operator:

Constituency Treebank Examples:

[cat="MF"] .2 [cat="NX"]

Dependency Treebank Examples:

[word="der"] .2 [word="Mann"]

To indicate a sequence at a bounded distance, add the bounding values separated by a comma after the "." operator:

Constituency Treebank Examples:

[cat="MF"] .2,4 [cat="NX"]

Dependency Treebank Examples:

[word="der"] .2,4 [word="Mann"]

Adjacent categories

Adjacency is defined for tokens with respect to their surface order. If two words appear next to each other in text, then they will match a query using ".", no matter what the tree structure is.

For non-token nodes, adjacency is determined with respect to the token nodes underneath them. In the example below, the word "Vorsitzende" is adjacent to the "NX" node, highlighted below, because the word next to "Vorsitzende" is the first word of the "NX" phrase.

[word="Vorsitzende"] . [cat="NX"]

Siblings

Constituency Treebank Examples:

The "$" operator queries nodes that are siblings, regardless of order or surface distance:

[word="Mann"] $ [word="der"]

Dependency Treebank Examples:

[word="dem"] $ [word="in"]

Sequential siblings

The "$.*" operator matches siblings in order from left to right, depending on their order in the tree, regardless of the surface distance between them:

Constituency Treebank Examples:

[word="der"] $.* [word="Mann"]

Dependency Treebank Examples:

[word="der"] $.* [word="große"]

Equality

To test if nodes are the same or different, use "=" and "!=". This can be used with queries that require matching more than one node with the same specification.

Constituency Treebank Examples:

#1:[cat="NX=PER"] > #2:[pos="NE"] & #1 >* #3:[pos="NE"] & #2 != #3

Dependency Treebank Examples:

#1:[pos="VERB"] > #2:[pos="VERB"]  & #1 > #3:[pos="VERB"] & #2 != #3

Negated relations

All relations are negated by inserting "!" before the relation:

[cat="MF"] !> [cat="NX"]

[cat="MF"] !>* [cat="NX"]

[cat="SIMPX"] !>~ [cat="NX"]

[word="der"] !$.* [word="Mann"]

All negation matches the first node only when no matching second node satisfies the relation. For example,

[word="der"] !.* [word="Mann"]

matches:

TüNDRA supports five predicates on tree nodes:

Root

Tests for nodes that are the root of a tree in the treebank:

Constituency Treebank Examples:

root(#r)

root([cat="VROOT"])

Dependency Treebank Examples:

root(#r)

root([cat="ROOT"])

Arity

Tests the number of immediate children a node has.

With just one numerical argument, it matches nodes that have exactly that number of children. With two numerical arguments, it matches nodes that have any number of children between the first and the second arguments:

Constituency Treebank Examples:

"Berlin" & arity([cat="MF"],2)

"Berlin" & ([cat="MF"],2,5)

Dependency Treebank Examples:

"haben" & arity([word="haben"],2)

"haben" & arity([word="haben"],2,5)

Tokenarity

Tests the number of tokens underneath a node.

With just one numerical argument, it matches nodes that have exactly that number of token descendants. With two numerical arguments, it matches nodes that have any number of token descendants between the first and the second arguments.

Constituency Treebank Examples:

"gehen" & tokenarity([cat="MF"],2)

"gehen" & tokenarity([cat="MF"],2,5)

Dependency Treebank Examples:

"haben" && tokenarity([word="haben"],2)

"haben" && tokenarity([word="haben"],2,5)

Continuous

Tests if all the tokens beneath a node form a continuous sequence:

Constituency Treebank Examples:

"müssen" & continuous([cat="MF"])

Dependency Treebank Examples:

"sehen" & continuous([pos="NOUN"])

Discontinuous

Tests if all the tokens beneath a node do not form a continuous sequence:

Constituency Treebank Examples:

"müssen" & discontinuous([cat="NX"])

Dependency Treebank Examples:

"sehen" & discontinuous([pos="VERB"])

Negation

Predicates can be negated with !:

!root(#1)

!arity([cat="NX"],2)

NB: Beware that by itself, a negated predicate matches any sentence where the predicate does not hold for any node. For example,

#1:[cat="NX"] & !arity(#1,2)

finds all nodes matching [cat="NX"] which do not have more than 2 children. In contrast, !arity([cat="NX"],2) does not match any node, but matches all sentences that do not contain any match for [cat="NX"] with 2 or more children.

TüNDRA is a free treebank research tool
supported by the CLARIN-D project and SfSSeminar für Sprachwissenschaft
at Eberhard Karls Universität Tübingen

v.beta-2

An error occurred while contacting the server.