TüNDRA

Tübingen aNnotated Data Retrieval Application

Web tool for treebank research

Welcome to the TüNDRA Tutorial

1. Introduction

TüNDRA - the Tübingen aNnotated Data Retrieval Application - is a treebank search application based in part on the popular but no longer supported TIGERsearch program. Treebanks are corpora (a collection of texts used for research) with annotations that indicate the relations between parts of the text. They are constructed ether manually (by human experts) or automatically (by using special software). But where and why are these treebanks needed?

In linguistics treebanks are a source of empirical evidence and especially useful in syntax(=grammar), lexicography, sociolinguistics, and in historical linguistics.
Treebanks are important for constructing and training programs for natural language processing, like search engines (e.g. Google, Bing, DuckDuckGo),speech recognition programs (e.g Google Cloud Speech-To-Text,Nuance speech recognition solutions,Microsoft Azure Speech To Text),OCR (e.g. ABBYY Fine Reader), spam filters, and question answeringsystems (e.g. Apple Siri, Microsoft Cortana,Google Assistant, Amazon Alexa, etc.)
For digital humanities treebanks help to extract as much meaning as possible form as many texts as possible by using text where the words already disambiguated and their relations laid out in a computer readable form.

Tree drawing in linguistics was invented in the 1950s by two different researchers working independently

Noam Chomsky drew trees as hierarchies of phrases. In present-day linguistics, these are usually called constituency trees.
At roughly the same time, Lucien Tesnière was representing relations by drawing lines between words. This type of represntation is usually called a dependency tree.

Constituency tree example

Dependency tree example

Both kinds of trees are used in treebanks, depending on who made them and why. Treebanks contain many sentences, with their annotations and relations encoded in a form that computers can read. You can use TüNDRA to search for and look at many different sentences in different ways, in order to study language in general, or to study particular words and structures, or to analyse how specific authors or genres use lanuguage.

2. Getting Started with the Query Panel

Important! The German query examples will work with TüBa-D/Z treebank starting from version 10, for the English ones please use UD English Web Treebank version 1.3. Some of the queries may not work for other treebanks, since each treebank is different and may not have the attributes (annotations) included in the queries. Tag set knowledge of a particular treebank is required to construct a meaningful and syntactically correct query. Although in most cases the attribute names are identical, it is worth checking the tag set of a particular treebank. Information about the tag set can be found on a separate page or by studying the Table view table displayed below the visualisation area in TüNDRA.

First you need to select and open TüBa-D/Z (version 10 or higher) constituency treebank. You can find it in the table on the main page. The table is searchable and filterable. TüBa-D/Z treebank has been made based on a manually annotated collection of German newspapers. More information is available here

To run a query you need to follow these steps:

Enter a query into the search form at the top of the page. For the sake for simplicity we are going to look for sentences which contain the word Geld. If the query syntax is not correct, the system will report it immediately: a error message will be displayed below the form and the Run button will not be available. If the query is correct, the button will turn green. It means that a query may be initiated.
Syntax error in a query
Query panel with a correct query
Table view area
Run the query by clicking the Run button
The query might take a while (depending on its complexity and the size of a treebank). The query statistics can be viewed while the query is running. It is displayed in a table below the Table view.
In order to modify the statistics that are shown in the table, click Add/Remove Columns.
In general, selecting more attributes to add results in a larger number of more specific rows.
For more complex real world examples as well are more explanation about columns and node variables, see here.
Default statistics table
Adding and removing columns in the statistics table
Statistics table with more columns
Query matches will be displayed immediatly once they are found
Running a query
Use the query navigation buttons (above the sentence text) to click through the sentence matches, to jump back and forth in the query results click the current query match number - it opens a dialog box that lets you jump to a query match by number
Goto sentence/query match modal window
The query matches are marked in red. The path in a tree, as well as variable names (if they have been set previously) are highlighted.
By clicking the buttons in the top right corner you can zooming in (+) and out (-), pan the image, and save it in different formats
Visualisation controls
By moving the mouse cursor over words or nodes in a tree in the visualisation area (it will change the subtree color to blue) you can see in which relation these words stand, or path between nodes (e.g. children of a selected node, or a path to the root node for a terminal character)
Interactive visualisation
To stop a query while it is running, click the red Stop button below the search form

3. Constructing a Query

3.1. Simple Query

To find the sentences that include the word Deutschland in them, type Deutschland, appending the quotation marks, in the search box, and click on the Run button. TüNDRA will find and display all sentences including the word. This is a very basic construction that allows to look for a surface form.

To find the sentences that include the words Deutschland and Land in them, type "Deutschland" & "Land", appending the quotation marks (each) and using the "&" operator in between. TüNDRA will find and display all sentences including both words. More words can be added.

3.2. Basic Word Form Search (Word - Lemma)

A simple query like "me" gives you the words that match exactly the given string. Often, we want to find all forms of a word, for example forms of the first person singular pronoun "I" and "me". If the Treebank you are working on includes lemma annotations, one way to achieve this, is to search for the lemma with a search expression:

[lemma = "I"]

Here, as well as the quotation marks, the square brackets and equals sign are treated as meta-characters by the query language. If you want to use any of these literally in your queries, you need to prefix them with backslash, "\".

There are more of these meta-characters. You will see more in the rest of the tutorial. For a complete list, see: Complete Query Language Explanation

Another way to search for word forms is to make use of a full stop for unspecific letters. To find all forms of the verb to sing, matching "sing" - "sang" - "sung" - "song", search for the word in backslashes and type in a full stop for the unspecific letter:

[word = /s.ng/]

The full stop is treated as a wildcard for unspecific letters by the query language.

To search for a word with a certain prefix and/or suffix, type in the following search expression using a full stop followed by an asterisk (".*") before the suffix or after the prefix:

[word = /.*able/]

This will find the sentences including all words with the suffix "-able", like "able" - "table" -"capable" - "suitable" - "available".

[word = /un.*/]

This will find the sentences including all words with the prefix "-un", like "unhappiness" -"undoable" - "unsolicited" - "unresolved".

The full stop is treated as a wildcard for unspecific letters, the asterisk as unrestricted repetition by the query language. More letters can be marked as unspecific by typing in either the "+" operator for repetition with one minimum or the "*" operator for unrestricted repetition. To find word variables matching "recognize" - "realize" - etc., type in the following search expression:

[word = /re.+ize/]

To find word variables matching "English" - "England" - etc., type in the following search expression (in some cases, capitalization needs to be considered):

[word = /Engl.*/]

To search for derivational or compositional word form variables, insert the fixed component of the word into parentheses and append the optional part followed by a question mark:

[word = /(newspaper)s?/]

TüNDRA will find the sentences including either "newspaper" or "newspapers" or both. The question mark is treated as optionality by the query language. Alternatives can also be given using squared brackets for the optional part and a vertical bar for disjunction:

[word = /neighb[ou|o]r/]

TÜNDRA will find the sentences including the alternative orthography "neighbour" and/or "neighbor".

Often, we want to find multiple values or a word group. Multiple values can be added using a disjunction character between the values:

[lemma = ("cat" | "dog" | "bird")]

Additionally, single units of a search expression can be specified with more than one attribute. To find all nouns with the prefix "un-", type in the following search expression (case sensitive):

[pos = "NOUN" & lemma = /un.*/]

Both conjunction (&) and disjunction (|) can be combined. The default order of operations is conjunction before disjunction.

Variables make it possible to query for single units with the same values for multiple attributes. In order to find variables, it might be necessary to query for nodes. TüNDRA supports terminal [T] and non-terminal [NT] nodes. Node variables can be added using a hash mark followed by the category and the aimed number restriction. To find the sentences including at least one or more of the three cities "London" - "Washington" - "Paris", type in the following search expression:

#city: [lemma = ("London" | "Washington" | "Paris")]

3.3 Matching Part-of-Speech (POS)

If the Treebank you are working on includes POS tagging, you can search for POS tags, as well. To find all sentences including at least one past participle or more, type in the following search expression:

[pos = "VVPP"]

The above query will work only with TüBa D/Z treebank, since most of the other treebanks do not have such an annotation. They may have a coarse grained POS tags:

[pos = "VERB"]

To find all sentences including at least one conjunction or more, type in the search expression for TüBa D/Z:

[pos = "KON"]

This query most likely will work for the majority of treebanks:

[pos = "CONJ"]

If you want to find a plural noun, e.g. "lights", but exclude the verbal form, type in the following search expression:

[word = "lights"] = [pos = "NOUN"]

The two query parts are treated as equivalent via the equal operator by the query language.

3.4 Lexico-grammatical patterns (Morph)

Often, we also want to further restrict our searches in terms of morphological information. If the Treebank you are working on provides morphological information, you can query on morphological properties. To find all nouns which match the morphological properties "nominative-singular-feminine", type in the following search expression:

[morph = "nsf"] = [pos = "NN"]

If the equal operator is inserted, those units of the search expression are treated as equivalent by the query language.

3.5 Word Sequences and Proximity Queries

Word searches can be extended via the & operator. If you type in the following search expression, it will find all sentences including "the", "whole" and "thing", but the words will be in an unrestricted order:

[word ="the"] & [word="whole"] & [word="thing"]

To restrict the nodes of a word sequence, special characters can be added. To find adjacent nodes, type in a full stop between the words you want to connect:

[word ="the"] . [word="whole"] . [word="thing"]

The "." operator queries nodes that follow each other at a direct adjacent distance in the same sentence.

To restrict the nodes of a word sequence to a fixed distance, make use of numbers after the full stop. For a fixed distance of two, type in the following search expression:

[word = "the"] .2 [word = "man"]

TÜNDRA will find results with the article "the", something in between, and the noun "man" in the same sentence.

To restrict the word sequence to a bounded distance, add the bounding values separated by a comma after the "." operator:

[word = "the"] .2,3 [word = "man"]

TÜNDRA will find results with the article "the", two words in between, and the noun "man" in the same sentence.

To restrict the nodes of a word sequence to any distance, add the asterix after the full stop. The ".*" operator queries nodes that follow each at any distance in the same sentence.

[word = "the"] .* [word = "man"]

Word sequence restriction also works with Lemmata, POS-Tags and morphological patterns:

To search for a word sequence including any article followed by an adjective followed by a noun, type in the following search expression:

[pos = "DET"] . [pos = "ADJ"] . [pos="NOUN"]

Some sentences in some treebanks have a 'multiword' column, which can be used to search multi-token words which do not appear in their original form in TÜNDRA. For example, UD does not consider contractions (e.g. English "can't", German "im") to be single tokens, and breaks these up into two tokens. To find such words in their original form, the following query can be used:

[multiword="im"].2"Rahmen"

This will search for instances of sentences which contained the string ("im Rahmen") in their original form. The '2' is necessary here to skip over the second token since the 'multiword' column only has a value for the first token of a multi-token word.

3.6 Siblings and Relations/ Node and Edge Labels

If the Treebank you are working on supports binary-branched constituency trees, you can use the following ancestry and descendant relations. To search for an immediate parent-child relation, type in the ">" operator which indicates that the second part of the relation is the immediate child of the first part. If you want the lemma "London" to be a direct child of an "NX", type in the following search expression:

[cat = "NX"] > [lemma="Geld"]

If you want the lemma "London" to have a parent-child relation, but at a fixed distance, add an asterix (for unrestricted) or a fixed number (for restricted) to the ">" operator:

[cat = "NX"] >2 [lemma = "Geld"]
[cat = "NX"] >* [lemma = "Geld"]

To query on edge nodes (e.g. certain cases like ON, OA, OG, OD), append the edge label to the ">" operator. To find a category NX with the edge label "OA" (accusative object) in the middle field, type in the following search expression:

[cat = "MF"] >OA [cat ="NX"]

It is also possible to query for lists of edges with the disjunction operator "|". For example, to query for two or more cases, put them in parentheses separated by the bar:

[cat = "MF"] >(OA|ON) [cat = "NX"]

To query on a leftmost and rightmost descendant of a non-terminal node, use the ">@l" and ">@r" operators. To query on the leftmost article of an NX, type in the following search expression:

[cat = "NX"] >@l [lemma = "the"]

To query on the rightmost noun of an NX, type in the search expression:

[cat = "NX"] >@r [pos = "NE"]

To query on secondary edges, use the ">~" operator. To find an "NX" as a phrase internal part modified by a "PX", type in the following search expression:

[cat = "PX"] >~ [cat = "NX"]

To query for secondary edge labels (refvc, refmod, refint, refcontr), append them to the ">~" operator:

[cat = "PX"] >~refmod [cat = "NX"]

The "$" operator queries nodes that are siblings, regardless of order or surface distance, but can also be restricted from left to right:

[word = "Geld"] $ [word = "das"]
[word = "das"] $.* [word = "Geld"]

3.7 Negation

Negation in query language is a very sensitive case and sometimes an intuitive negation does not work in query language. In general, all relations are negated by inserting "!" before the relation.

To query on a category or an edge label in the middle field which is not "NX", type in the following search expression:

[cat = "MF"] !> [cat = "NX"]

Edge labels can also be negated. To query on an "NX" in the middle field, including the restriction that it should not be in either accusative or dative case, type in the following search expression:

[cat = "MF"] >!(OA|ON) [cat = "NX"]

In some cases the introduction of intuitive negation causes problems. The intuitive negation to search for the category "NX" with the restriction that it should not match the word "money" would be:

[cat = "NX"] !> [word = "money"]

This will only match a sentence that contains a constituent with the category "NX" and a token with the value "money".

Additionally, to further restrict the category "NX" to not referring to persons "PER", the following search expression will only match a sentence that contains a constituent with the category "NX" and find the category "PER" as well:

[cat = "NX"] !> [cat = "ORG"]

4 Gathering Statistics: Further Explanations and Examples

Query variables not only provide a means of specifying complex structures and relations, but they also play a role in gathering statistics. Statistics are collected on all nodes that are identified by a node variable (e.g. #np:[cat="NX"]).

If no variables are present in a query, or if a variable name is applied to an attribute (e.g. [pos= “NN” & #1:morph=/[g|d].*/]), then node identifiers will be assigned by the system. However, relying on system names may make it more difficult to identify nodes when adding or removing columns in the statistics table. Therefore, it is recommended to assign variable names to nodes that should be viewed in the statistics table, even if the variable is not strictly needed for the query. For example:

[#1:word=/B.*/] .* [#1]

Adding variable names to the nodes will make it easier to identify them when adding columns:

#n1:[#1:word=/B.*/] .* #n2:[#1]

Example 1

In this example, a search is performed for direct questions that start with a question word, such as "wo", "wann", or "wer". The query specifies a wh-word (with part-of-speech label "PWAV") followed by a left bracket that dominated a finite verb phrase ("VXFIN"):

#n1:[pos = "PWAV"]  .  #n2:[cat = "LK"]  >  [cat = "VXFIN"]

Initially, the statistics table just shows the number of matches:

Click 'Add/Remove Columns'. Variable #n1 is chosen to be displayed to obtain the relative frequencies for the fronted question words. Select only 'lemma' so that capitalized tokens are not treated separately.

Sorting by occurrences (click on the 'Occurrences' table header):

This query can be extended to also find the verb at the terminal node:

#n1:[pos = "PWAV"]  .  #n2:[cat = "LK"]  >  #n3:[cat = "VXFIN"] & #n3 >* #n4:[T]

In statistics, choose to display the lemmas of nodes #n1 and #n4 to obtain the relative frequencies for co-occurrences of fronted question words and verbs.

Sorting by occurrences:

Example 2

In this example, occurrences of auxiliary fronting in the verbal complex of subordinate clauses in German are found. e.g. 'wird lösen müssen', 'hat beantworten können', 'hätte fragen sollen'

This query specifies a node with category VCE followed by a sequence of a non-finite main verb and a non-finite modal verb:

[cat = "VCE"]  >* #n1: [ pos = "VAFIN"]  . #n2:[pos = "VVINF"] .  #n3:[pos = "VMINF"]

The lemmas of nodes #n1 and #n3 are chosen

This yields the co-occurrences of the lemma for the fronted auxiliary (werden bzw. haben) and the modal verb.

Example 3

In this example, statistics on variance between the genitive and dative case for the preposition "wegen": e.g. "wegen des Unfalls" (genitive case) versus "wegen dem Unfall" (dative case) are calculated.

[cat = "PX"] > [word = "wegen"] . [cat = "NX"]  > #n4:[pos= "NN" & morph=/[g|d].*/]

In statistics mode the morph option for node #n4 is chosen:

This allows us to inspect the relative frequencies of all possible morph values, in particular the distinction between genitive and dative case.

TüNDRA is a free treebank research tool
supported by the CLARIN-D project and SfSSeminar für Sprachwissenschaft
at Eberhard Karls Universität Tübingen

v.beta-2

An error occurred while contacting the server.