Detailed Guide to TüNDRA Query Syntax
Nodes and Attributes
Nodes are specified between square brackets. Queries for node matching an particular attribute value are always in the form:
[Attribute=Value]
String values must be in quotes.
[cat="NX"]
[word="aber"]
[pos="NN"]
[edge="obl"]
Some sentences in some treebanks have a 'multiword' column, which can be used to search multi-token words which do not appear in their original form in TÜNDRA. For example, UD does not consider contractions (e.g. English "can't", German "im") to be single tokens, and breaks these up into two tokens. To find such words in their original form, the following query can be used:
[multiword="im"].2"Rahmen"
Boolean operators
TüNDRA supports two boolean operations: & (AND) and| (OR). These can join relations and individual node specifications.
[word="Berlin"] | [word="Bonn"]
[word="Berlin"] & [cat="NX"]
[word="Berlin"] & [word="haben"]
These operations can also join attribute specifications inside of node specifications, as described below.
Order of operations
The AND operator (&) has priority over OR(|). Use parentheses to fix the order of operations. The following two queries are not equivalent:
[word="Berlin"] & [word="Bonn"] | [word="Kanada"]
[word="Berlin"] & ([word="Bonn"] | [word="Kanada"])
The same order of operations and parentheses apply to attributes inside of node specifications.
Reserved and special characters in attributes
The quote character (") must be preceded by a backslash (\) in any attribute. To find the quote, use:
[word="\""]
To search for the backslash character, it must also be preceded by a backslash:
[word="\\"]
When doing simple searches with quotes, all other characters should match correctly without backslashes or other particular markers. Backslashes other than ones used before the quote or the backslash are automatically deleted from attribute queries. The following two queries are therefore identical to TüNDRA:
[word="US-\$"]
[word="US-$"]
The rules for escaping characters are different for regular expressions. See the section on regular expressions for details.
NB: The attributes word and token are coded to be equivalent. The queries [word="foo"] and [token="foo"]have identical results.
Special simple queries
Entering a simple string between quotes matches any node with a token,lemma, category, or subcategory that matches that string.
So, the query
"Kanada"
is equivalent to:
[token="Kanada" | lemma="Kanada" | cat="Kanada" | subcat="Kanada"]
Unicode
TüNDRA uses Unicode at all levels, and Unicode characters can be used directly to match attributes in queries. Unicode characters can also be referenced numerically in both string values (between quotes - "") and in regular expressions. For example, to query for the ndash character (–), users can write – or ‒.
Furthermore, the non-breaking space character can be queried using . Other HTML named entities are not currently supported.
Regular Expressions
Attribute values can also match regular expressions. Regular expressions are always between forward slashes (/). TüNDRA followsXQuery pattern matching and regular expression syntax, with some changes to enhance TIGERSearch compatibility.
All regular expressions must match the entire string, as if preceded with ^ and followed by $.
[cat=/F.+N/]
[word=/ge.+ß/]
[lemma=/[dD]eut.*/]
[word=/.*[aöüAÖÜ].*/]
In general, TüNDRA supports the most common regular expression operations. Below is summary of the most common operations supported in pattern matching:
. | unspecified character | [word = /sag./] |
* | unrestricted repetition | [lemma = /spiel.*/] |
+ | repetition with minimum 1 | [word = /.+[0-9A-Z]+.*/] |
? | optionality | [word = /(Leben)s?/] |
[ ] | character set | [word = /.+[0-9A-Z]+.*/] |
^ | negated character sets | [word = /[^0-9A-Z].*/] |
( ) | grouping | [word = /([lmnp][aeiou])+/] |
| | disjunction | [word = /[dD](as|er)/] |
\ | escape for reserved characters | [word = /.*\-.*/] |
Reserved and special characters in regular expressions
The following characters are reserved in regular expression syntax and must be preceded by a backslash if they are to be matched:
( ) [ ] . + $ * ? | / \
For example, to search for any token containing the dollar sign ($), use:
[word=/.*\$.*/]
In addition, for compatibility with TIGERSearch, backslashes are removed from pattern matches if they precede the following characters:
! # , - : ; = > @ & " '
For example, to find all tokens with apostrophes, the following queries are equivalent:
[word=/.*\'.*/]
[word=/.*'.*/]
All other backslashes are preserved, even where they result in empty search results, because of the role they play in regular expression syntax. When a pattern match behaves unexpectedly, make sure any backslashes are placed according to XQuery regular expression syntax and this document.
Multiple values
Attribute values can be specified by disjunctions of multiple values:
[cat=("VXFIN"|"VXINF")]
[lemma=("Hund"|"Katz"|"Vogel")]
[cat=("NX"|/V.*/)]
[edge=("csubj"|"nsubj")]
All disjunctions of attribute values must be between parentheses and separated by the bar character (|).
Negation
Attribute-value pairs can be negated by replacing the equals sign= with != :
[cat!="NX"]
[lemma!=/.*[äöüßÄÖÜ].*/]
[rel!="nsubj"]
Reserved Characters
When a value contains a reserved character, it may have to be searched as a regular expression for certain queries in order to avoid syntax errors.
[pos = "NOUN"] > /acl:relcl/ [pos = "VERB"]
Multiple attributes
Nodes can be specified with more than one attribute:
[pos="NN" & lemma=/ge.*/]
[pos="NOUN" & lemma=/ge.*/]
[pos=/V.*/ | lemma!=/.*en/]
Both conjunctions (&) and disjunctions (|) can be combined. The default order of operations is conjunction before disjunction. The two node specifications below are equivalent:
[pos=/V.*/ | lemma=/.*en/ & word=/[A-Z].*/]
[pos=/V.*/ | (lemma=/.*en/ & word=/[A-Z].*/)]
But they are not equivalent to:
[(pos=/V.*/ | lemma=/.*en/) & word=/[A-Z].*/]
Attribute variables
Variables always start with # and can be followed by any sequence of letters and numbers:
[lemma=#1:/a.*/]
[lemma=#a1:/a.*/]
[lemma=#starts_with_a:/a.*/]
Variables make it possible to query for nodes with the same values for multiple attributes:
[lemma=#1:/a.*/ & word=#1]
Variables also make it possible to query for nodes with some matchings attributes. To search for sentence with two identical words starting with the letter "B", either of the following will work:
[word=#1:/B.*/] .* [word=#1]
[#1:word=/B.*/] .* [#1]
Node Variables
Matched nodes may also be assigned to variables:
#city:[lemma=("Berlin"|"Hamburg"|"Stuttgart")]
#np1:[cat="NX"]
Node variables can be used in relations, so that whole tree structures are specified:
An example from a constituency treebank:
#1:[cat="NX"] >* [word="Berlin"] & #1 >* [lemma="Flüchtling"]
This query matches the structure below:
An example from a dependency treebank:
#1:[pos="VERB"] >* [word="dem"] & #1 >* [lemma="Land"]
This query matches the structure below:
Node Classes
TüNDRA supports queries on two general node classes in constituent trees:terminal and non-terminal.
- Terminal nodes are nodes without descendants. They are be queried with [T].
- Non-terminal nodes are nodes with descendants. They are be queried with [NT].
Node classes can be combined with attribute value queries:
[T & pos="NN"]
[NT & cat=/V.*/]
More commonly, these classes are used to query for arbitrary nodes in some relation to another:
Examples from a constituency treebank:
[NT] > [pos="NN"]
[cat="PX"] >* [T & pos!=/N.*/]
An example from a dependency treebank:
[NT & pos="NOUN"] > [pos!="NOUN"]
Relations
All of the relations supported by TüNDRA are binary relations - they only support relations between two nodes. To query structures of three or more nodes requires node variables.
Ancestry and descent
Immediate parent-child relations are indicated with the ">" operator. (right)
Constituency Treebank Examples:
[cat="NX"] > [lemma="Berlin"] (See figure above)
[cat="MF"] > [cat="NX"]
Dependency Treebank Examples:
[word="haben"] > [word="das"]
The ">*" operator is used to indicate ancestor-descent relations of any distance. (below)
Constituency Treebank Examples:
[cat="MF"] >* [cat="NX"]
[cat="NX"] >* [lemma="Berlin"]
Dependency Treebank Examples:
[word="haben"] >* [word="das"]
To indicate ancestor-descent at a fixed distance, add a number after the > operator:
Constituency Treebank Examples:
[cat="MF"] >2 [cat="NX"]
Dependency Treebank Examples:
[word="haben"] >2 [word="das"]
To indicate ancestor-descent at a bounded distance, add the bounding values separated by a comma after the ">" operator:
Constituency Treebank Examples:
[cat="MF"] >2,4 [cat="NX"]
Dependency Treebank Examples:
[word="haben"] >2,4 [word="das"]
Edge labels
To query on edges between nodes, append the edge label to the ">" operator. (right)
Constituency Treebank Examples:
[cat="NX"] >HD [pos="NN"] (See figure above)
[cat="MF"] >OA [cat="NX"]
Dependency Treebank Examples:
[word="Tat"] >case [word="in"]
To query for lists of edges, put them in parentheses separated by the bar (|):
Constituency Treebank Examples:
[cat="MF"] >(OA|ON) [cat="NX"]
Label edges can also be negated:
Constituency Treebank Examples:
[cat="MF"] >!OA [cat="NX"]
[cat="MF"] >!(OA|ON) [cat="NX"]
Labeled edges
In treebanks with a mixture of labeled and unlabeled edges, the relation ">%" finds nodes connected with labeled edges:
Constituency Treebank Examples:
[cat="NX"] >% [cat="NX"]
And to query for unlabeled edges, use ">!%":
Constituency Treebank Examples:
[cat="NX"] >!% [cat="NX"]
Positional parentage
To query on the leftmost and rightmost descendent of a non-terminal node, use the ">@l" and ">@r":
Constituency Treebank Examples:
[cat="NX"] >@l [lemma="der"]
[cat="NX"] >@r [pos="NE"]
Dependency Treebank Examples:
[word="die"] >@l [word="Länder"]
[word="Tausende"] >@r [word="Vertriebenen"]
To negate positional parentage, "!>@l" is equivalent to ">!@l", "!>@r" is equivalent to ">!@r", and "!>!@l", ">@l", "!>!@r", ">@r" are equivalent respectively.
Secondary edges
Secondary edges can be queried with the ">˜" operator:
Constituency Treebank Examples:
[cat="PX"] >~ [cat="NX"]
Dependency Treebank Examples:
[word="haben"] >~ [word="das"]
Secondary edge labels can be queried by appending them to the ">˜" operator:
Constituency Treebank Examples:
[cat="PX"] >~refmod [cat="NX"]
Dependency Treebank Examples:
[word="haben"] >~obj [word="das"]
Sequence
Adjacent nodes can be queried using the "." operator. (below)
Constituency Treebank Examples:
[cat="MF"] . [cat="VC"]
[word="der"] . [cat="NX"]
[word="der"] . [word="Mann"]
Dependency Treebank Examples:
[word="der"] . [word="Mann"]
The ".*" operator queries nodes that follow each other at any distance in the same sentence:
Constituency Treebank Examples:
[word="der"] .* [word="Mann"]
Dependency Treebank Examples:
[word="der"] .* [word="Mann"]
To indicate a sequence at a fixed distance, add a number after the "." operator:
Constituency Treebank Examples:
[cat="MF"] .2 [cat="NX"]
Dependency Treebank Examples:
[word="der"] .2 [word="Mann"]
To indicate a sequence at a bounded distance, add the bounding values separated by a comma after the "." operator:
Constituency Treebank Examples:
[cat="MF"] .2,4 [cat="NX"]
Dependency Treebank Examples:
[word="der"] .2,4 [word="Mann"]
Adjacent categories
Adjacency is defined for tokens with respect to their surface order. If two words appear next to each other in text, then they will match a query using ".", no matter what the tree structure is.
For non-token nodes, adjacency is determined with respect to the token nodes underneath them. In the example below, the word "Vorsitzende" is adjacent to the "NX" node, highlighted below, because the word next to "Vorsitzende" is the first word of the "NX" phrase.
[word="Vorsitzende"] . [cat="NX"]
Siblings
Constituency Treebank Examples:
The "$" operator queries nodes that are siblings, regardless of order or surface distance:
[word="Mann"] $ [word="der"]
Dependency Treebank Examples:
[word="dem"] $ [word="in"]
Sequential siblings
The "$.*" operator matches siblings in order from left to right, depending on their order in the tree, regardless of the surface distance between them:
Constituency Treebank Examples:
[word="der"] $.* [word="Mann"]
Dependency Treebank Examples:
[word="der"] $.* [word="große"]
Equality
To test if nodes are the same or different, use "=" and "!=". This can be used with queries that require matching more than one node with the same specification.
Constituency Treebank Examples:
#1:[cat="NX=PER"] > #2:[pos="NE"] & #1 >* #3:[pos="NE"] & #2 != #3
Dependency Treebank Examples:
#1:[pos="VERB"] > #2:[pos="VERB"] & #1 > #3:[pos="VERB"] & #2 != #3
Negated relations
All relations are negated by inserting "!" before the relation:
[cat="MF"] !> [cat="NX"]
[cat="MF"] !>* [cat="NX"]
[cat="SIMPX"] !>~ [cat="NX"]
[word="der"] !$.* [word="Mann"]
All negation matches the first node only when no matching second node satisfies the relation. For example,
[word="der"] !.* [word="Mann"]
matches:
Predicates
TüNDRA supports five predicates on tree nodes:
Root
Tests for nodes that are the root of a tree in the treebank:
Constituency Treebank Examples:
root(#r)
root([cat="VROOT"])
Dependency Treebank Examples:
root(#r)
root([cat="ROOT"])
Arity
Tests the number of immediate children a node has.
With just one numerical argument, it matches nodes that have exactly that number of children. With two numerical arguments, it matches nodes that have any number of children between the first and the second arguments:
Constituency Treebank Examples:
"Berlin" & arity([cat="MF"],2)
"Berlin" & ([cat="MF"],2,5)
Dependency Treebank Examples:
"haben" & arity([word="haben"],2)
"haben" & arity([word="haben"],2,5)
Tokenarity
Tests the number of tokens underneath a node.
With just one numerical argument, it matches nodes that have exactly that number of token descendants. With two numerical arguments, it matches nodes that have any number of token descendants between the first and the second arguments.
Constituency Treebank Examples:
"gehen" & tokenarity([cat="MF"],2)
"gehen" & tokenarity([cat="MF"],2,5)
Dependency Treebank Examples:
"haben" && tokenarity([word="haben"],2)
"haben" && tokenarity([word="haben"],2,5)
Continuous
Tests if all the tokens beneath a node form a continuous sequence:
Constituency Treebank Examples:
"müssen" & continuous([cat="MF"])
Dependency Treebank Examples:
"sehen" & continuous([pos="NOUN"])
Discontinuous
Tests if all the tokens beneath a node do not form a continuous sequence:
Constituency Treebank Examples:
"müssen" & discontinuous([cat="NX"])
Dependency Treebank Examples:
"sehen" & discontinuous([pos="VERB"])
Negation
Predicates can be negated with !:
!root(#1)
!arity([cat="NX"],2)
NB: Beware that by itself, a negated predicate matches any sentence where the predicate does not hold for any node. For example,
#1:[cat="NX"] & !arity(#1,2)
finds all nodes matching [cat="NX"] which do not have more than 2 children. In contrast, !arity([cat="NX"],2) does not match any node, but matches all sentences that do not contain any match for [cat="NX"] with 2 or more children.