The TCF Format
From WebLichtWiki
Background on WebLicht and Motivation for TCF
WebLicht is a service orchestration and execution environment for incremental automatic annotation of text corpora, built upon Service Oriented Architecture principles. Language processing tools (tokenizers, taggers, parsers, etc), provided as webservices, have been integrated into WebLicht.
A WebLicht webservice is simply a synchronous REST‐style web service. The service processes the input data synchronously and returns the result as output data, using the same HTTP connection. Each tool, implemented as a webservice, is hosted locally at the institution where it was developed.
Normally, a tool requires one or more annotation layers to be present in the input document, and those layers are used to generate further annotations. The WebLicht web application helps users to combine the available tools into valid processing chains. Running a processing chain means that tools are invoked one after the other, where the output of each tool is used as input for the next tool in the chain. Tool execution is performed in a distributed fashion by calls to the servers that host the tools in question.
Many of the tools that have been integrated into WebLicht are well known within the linguistic community and have been in use for many years. They were mainly developed as command‐line tools and each uses its own format for storing results. For this reason, it was not possible to combine such tools into processing chains easily. In order for the tools to “talk to each other”, it was necessary to create a highly efficient internal format for sharing data between services.
A common XML data exchange format (Text Corpus Format, TCF) has been developed within the WebLicht architecture to facilitate efficient interoperability between the tools The TCF format allows the various linguistic annotations produced by the tools within WebLicht to be stored in one document. It supports incremental enrichment of linguistic annotations at various levels of analysis in a stand-off XML‐based format. Each tool may add one or more annotation layers to the data document, but tools are not permitted to remove or alter any existing layers within the document. This means that the document grows with each annotation:
TCF is fully compatible with the Linguistic Annotation Format (LAF) and Graph-‐based Format for Linguistic Annotations (GrAF) developed in the ISO/TC37/SC4 technical committee (Ide & Suderman, 2007). It needs to be stressed that TCF is not meant as “yet another standard”. Rather, TCF is used as an internal processing format to support efficient data sharing and web service execution. Regarding the relationship of the processing format TCF and well-‐established standard formats, it should also be noted that WebLicht contains a set of web services to allow automatic conversions between TCF and most of the other formats used in the linguistic community.
TCF Versions
TCF has been developed for the exchange of annotated corpus data between tools and was designed to be maximally flexible with regard to the addition of annotation layers. The development of the format goes together with the development of the tools, adapting to the needs of the tools. Therefore, new layers are evolving with the development of new tools.
No changes are made in the specification of existing layers to ensure backward compatibility with existing tools. This is done to avoid a situation where webservices need to be rewritten to accommodate a schema change.
This document describes the most current version of TCF - v0.4.
TCF v0.4 Description and Examples
TCF stores linguistic annotations inside annotation layers which are organized in a stand-off manner. From an organizational point of view, the token layer can be seen as the central, atomic element to which other annotation layers refer. For example, the part-of-speech annotations refer to the token IDs in the token annotation layer via the attribute tokenIDs. Each annotation layer corresponds to a linguistically interesting aspect of the data, such as token boundaries, sentence boundaries, part-of-speech tags, phrase structure parsing, dependency relations, etc. In contrast to other stand-off formats, TCF is contained within one single document.
TCF contains an XML header specifying character encoding UTF-8, which is the required encoding for all TCF documents. The MetaData section is a placeholder for document related metadata, which is used by WebLicht web application to store metadata about resources and tools that were used to generate and process the document, ensuring the conformity with the CLARIN Metadata Schema (CMDI). The TextCorpus section contains the linguistic data itself, as well as specification of the language of the data as the two-letter ISO 639-1 code. The linguistic annotations are divided into separate sub-sections for each annotation layer, according to the stand-off based architecture.
Simple Example
The following example shows an overview of the TCF structure:
<?xml version="1.0" encoding="UTF-8"?> <D-Spin xmlns="http://www.dspin.de/data" version="0.4"> <MetaData xmlns="http://www.dspin.de/data/metadata"> <source></source> <Services> <CMD xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.clarin.eu/cmd/" CMDVersion="1.1" xsi:schemaLocation="http://www.clarin.eu/cmd/ http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/clarin.eu:cr1:p_1320657629623/xsd"> ... <Components> <WebServiceToolChain> ... <Toolchain> <ToolInChain> <PID>11858/00-1778-0000-0004-BA56-7</PID> <Parameter name="version" value="0.4"></Parameter> </ToolInChain> ... </WebServiceToolChain> </Components> </CMD> </Services> </MetaData> <TextCorpus xmlns="http://www.dspin.de/data/textcorpus" lang="de"> <text>Karin fliegt nach New York. Sie will dort Urlaub machen.</text> <tokens> <token ID="t_0">Karin</token> <token ID="t_1">fliegt</token> <token ID="t_2">nach</token> <token ID="t_3">New</token> <token ID="t_4">York</token> <token ID="t_5">.</token> <token ID="t_6">Sie</token> <token ID="t_7">will</token> <token ID="t_8">dort</token> <token ID="t_9">Urlaub</token> <token ID="t_10">machen</token> <token ID="t_11">.</token> </tokens> <sentences> <sentence ID="s_0" tokenIDs="t_0 t_1 t_2 t_3 t_4 t_5"></sentence> <sentence ID="s_1" tokenIDs="t_6 t_7 t_8 t_9 t_10 t_11"></sentence> </sentences> <POStags tagset="stts"> <tag ID="pt_0" tokenIDs="t_0">NE</tag> <tag ID="pt_1" tokenIDs="t_1">VVFIN</tag> <tag ID="pt_2" tokenIDs="t_2">APPR</tag> <tag ID="pt_3" tokenIDs="t_3">NE</tag> <tag ID="pt_4" tokenIDs="t_4">NE</tag> <tag ID="pt_5" tokenIDs="t_5">$.</tag> <tag ID="pt_6" tokenIDs="t_6">PPER</tag> <tag ID="pt_7" tokenIDs="t_7">VMFIN</tag> <tag ID="pt_8" tokenIDs="t_8">ADV</tag> <tag ID="pt_9" tokenIDs="t_9">NN</tag> <tag ID="pt_10" tokenIDs="t_10">VVINF</tag> <tag ID="pt_11" tokenIDs="t_11">$.</tag> </POStags> ... </TextCorpus> </D-Spin>
Annotation Layers
The next example shows a TCF document containing all possible annotation layers with an explanation of each layer.
Text
The text layer contains a character string representing natural language text. This layer is commonly used by linguistic tools to produce token and sentence boundary annotations.
<text>Karin fliegt nach New York. Sie will dort Urlaub machen.</text>
Tokens
The tokens layer is composed of token elements, each having a unique identifier (ID) and token string value, as presented below. Optionally, each token can reference its start and end character offset position in relation to the character string in the text layer. The tokens layer is the main anchor layer among TextCorpus layers, i.e. all other layers (with the exception of the text layer) directly or indirectly (via other layers) reference tokens by referencing token identifiers.
<tokens> <token ID="t_0">Karin</token> <token ID="t_1">fliegt</token> <token ID="t_2">nach</token> <token ID="t_3">New</token> <token ID="t_4">York</token> <token ID="t_5">.</token> <token ID="t_6">Sie</token> <token ID="t_7">will</token> <token ID="t_8">dort</token> <token ID="t_9">Urlaub</token> <token ID="t_10">machen</token> <token ID="t_11">.</token> </tokens>
Sentences
The sentences layer represents sentence boundary annotations. Each sentence element enumerates token identifiers of tokens that belong to this sentence, as shown below. Optionally, a sentence can have start and end character offset positions in relation to the character string in the text layer.
<sentences> <sentence ID="s_0" tokenIDs="t_0 t_1 t_2 t_3 t_4 t_5"/> <sentence ID="s_1" tokenIDs="t_6 t_7 t_8 t_9 t_10 t_11"/> </sentences>
Lemmas
The lemmas layer represents lemma annotations on tokens. Each lemma element references a token, or sequence of tokens, via token identifiers, and provides the lemma string value, as shown here:
<lemmas> <lemma ID="le_0" tokenIDs="t_0">Karin</lemma> <lemma ID="le_1" tokenIDs="t_1">fliegen</lemma> <lemma ID="le_2" tokenIDs="t_2">nach</lemma> <lemma ID="le_3" tokenIDs="t_3">--</lemma> <lemma ID="le_4" tokenIDs="t_4">York</lemma> <lemma ID="le_5" tokenIDs="t_5">--</lemma> <lemma ID="le_6" tokenIDs="t_6">sie</lemma> <lemma ID="le_7" tokenIDs="t_7">wollen</lemma> <lemma ID="le_8" tokenIDs="t_8">dort</lemma> <lemma ID="le_9" tokenIDs="t_9">Urlaub</lemma> <lemma ID="le_10" tokenIDs="t_10">machen</lemma> <lemma ID="le_11" tokenIDs="t_11">--</lemma> </lemmas>
Part-of-Speech
The POStags layer annotates tokens with part-of-speech tags, as shown below. As in the case of lemma elements, each tag element references a token, or sequence of tokens, and provides the tag string value. Tag values usually belong to some predefined standard tagset. The layer specifies the name of the tagset via the tagset attribute.
<POStags tagset="stts"> <tag ID="pt_0" tokenIDs="t_0">NE</tag> <tag ID="pt_1" tokenIDs="t_1">VVFIN</tag> <tag ID="pt_2" tokenIDs="t_2">APPR</tag> <tag ID="pt_3" tokenIDs="t_3">NE</tag> <tag ID="pt_4" tokenIDs="t_4">NE</tag> <tag ID="pt_5" tokenIDs="t_5">$.</tag> <tag ID="pt_6" tokenIDs="t_6">PPER</tag> <tag ID="pt_7" tokenIDs="t_7">VMFIN</tag> <tag ID="pt_8" tokenIDs="t_8">ADV</tag> <tag ID="pt_9" tokenIDs="t_9">NN</tag> <tag ID="pt_10" tokenIDs="t_10">VVINF</tag> <tag ID="pt_11" tokenIDs="t_11">$.</tag> </POStags>
Contituent Parsing
The constituent parsing layer represents phrase structure parsing annotations on sentence tokens. The layer specifies the tagset used for phrase structure categories. The parsed structure is a tree, where the terminal nodes reference tokens, and non-terminal nodes are composed of other nodes. Optionally, the nodes can include incoming edge labels. Additionally, secondary edges can be specified by referencing target nodes. Here is an example of a constituent parsing layer:
<parsing tagset="tigertb"> <parse> <constituent cat="Start" ID="c_15"> <constituent cat="SIMPX" ID="c_13"> <constituent cat="VF" ID="c_2"> <constituent cat="NX" ID="c_1"> <constituent cat="NE" ID="c_0" tokenIDs="t_0"/> </constituent> </constituent> <constituent cat="LK" ID="c_5"> <constituent cat="VXFIN" ID="c_4"> <constituent cat="VVFIN" ID="c_3" tokenIDs="t_1"/> </constituent> </constituent> <constituent cat="MF" ID="c_12"> <constituent cat="PX" ID="c_11"> <constituent cat="APPR" ID="c_6" tokenIDs="t_2"/> <constituent cat="EN" ID="c_10"> <constituent cat="NX" ID="c_9"> <constituent cat="NE" ID="c_7" tokenIDs="t_3"/> <constituent cat="NE" ID="c_8" tokenIDs="t_4"/> </constituent> </constituent> </constituent> </constituent> </constituent> <constituent cat="$." ID="c_14" tokenIDs="t_5"/> </constituent> </parse> <parse> <constituent cat="Start" ID="c_32"> <constituent cat="SIMPX" ID="c_30"> <constituent cat="VF" ID="c_18"> <constituent cat="NX" ID="c_17"> <constituent cat="PPER" ID="c_16" tokenIDs="t_6"/> </constituent> </constituent> <constituent cat="LK" ID="c_21"> <constituent cat="VXFIN" ID="c_20"> <constituent cat="VMFIN" ID="c_19" tokenIDs="t_7"/> </constituent> </constituent> <constituent cat="MF" ID="c_26"> <constituent cat="ADVX" ID="c_23"> <constituent cat="ADV" ID="c_22" tokenIDs="t_8"/> </constituent> <constituent cat="NX" ID="c_25"> <constituent cat="NN" ID="c_24" tokenIDs="t_9"/> </constituent> </constituent> <constituent cat="VC" ID="c_29"> <constituent cat="VXINF" ID="c_28"> <constituent cat="VVINF" ID="c_27" tokenIDs="t_10"/> </constituent> </constituent> </constituent> <constituent cat="$." ID="c_31" tokenIDs="t_11"/> </constituent> </parse> </parsing>
Dependency Parsing
The depparsing layer annotates dependency relations between tokens. Each dependency annotation contains a reference to a token, or sequence of tokens, that is in a dependent role in the given relation, as well as a reference to a token, or sequence of tokens, that is in the governor role in the given relation. Optionally, the function of the dependent-governor relation is specified. In some cases, such as in the case of root dependency, the governor can be omitted. Additionally, the dependency layer specifies: • a tagset for dependency function tags, • whether empty tokens can be inserted into the dependency parse, • whether a dependent can have more than one governor in the parse.
A dependency layer example:
<depparsing tagset="tiger" emptytoks="false" multigovs="false"> <parse ID="d_0"> <dependency govIDs="t_1" depIDs="t_0" func="SB"/> <dependency depIDs="t_1" func="ROOT"/> <dependency govIDs="t_1" depIDs="t_2" func="MO"/> <dependency govIDs="t_4" depIDs="t_3" func="PNC"/> <dependency govIDs="t_2" depIDs="t_4" func="NK"/> <dependency govIDs="t_4" depIDs="t_5" func="--"/> </parse> <parse ID="d_1"> <dependency govIDs="t_7" depIDs="t_6" func="SB"/> <dependency depIDs="t_7" func="ROOT"/> <dependency govIDs="t_10" depIDs="t_8" func="MO"/> <dependency govIDs="t_10" depIDs="t_9" func="OA"/> <dependency govIDs="t_7" depIDs="t_10" func="OC"/> <dependency govIDs="t_10" depIDs="t_11" func="--"/> </parse> </depparsing>
Morphology
The morphology layer specifies the morphological features of tokens. Each morphology annotation (analysis) contains a list of morphological features with either a feature name - feature value pair, or a feature name - feature sub-features pair. Additionally, the morphology layer specifies whether it contains optional segmentation annotations. Segmentation annotations specify morphological features of segments of a given token. An example of a morphological layer:
<morphology> <analysis tokenIDs="t_0"> <tag> <fs> <f name="cat">proper name</f> <f name="gender">feminine</f> <f name="case">nominative</f> <f name="number">singular</f> </fs> </tag> <segmentation> <segment cat="proper name" type="stem">Karin</segment> </segmentation> </analysis> <analysis tokenIDs="t_1"> <tag> <fs> <f name="cat">verb</f> <f name="person">3</f> <f name="number">singular</f> <f name="tense">present</f> <f name="indicative">true</f> </fs> </tag> <segmentation> <segment cat="verb" type="stem">fliegen</segment> </segmentation> </analysis> <analysis tokenIDs="t_2"> <tag> <fs> <f name="cat">preposition</f> <f name="case">dative</f> </fs> </tag> <segmentation> <segment cat="preposition" type="stem">nach</segment> </segmentation> </analysis> <analysis tokenIDs="t_4"> <tag> <fs> <f name="cat">proper name</f> <f name="case">neuter</f> <f name="case">nominative</f> <f name="number">singular</f> </fs> </tag> <segmentation> <segment cat="proper name" type="stem">York</segment> </segmentation> </analysis> <analysis tokenIDs="t_5"> <tag> <fs> <f name="cat">punctuation</f> <f name="punctuation">normal</f> </fs> </tag> <segmentation> <segment cat="punctuation" type="stem">.</segment> </segmentation> </analysis> <analysis tokenIDs="t_6"> <tag> <fs> <f name="cat">personal pronoun</f> <f name="personal">true</f> <f name="person">3</f> <f name="number">singular</f> <f name="gender">undef</f> <f name="case">nominative</f> <f name="inflection">weak</f> <f name="capitalizedForm">true</f> </fs> </tag> <segmentation> <segment cat="personal pronoun" type="stem">sie</segment> </segmentation> </analysis> <analysis tokenIDs="t_7"> <tag> <fs> <f name="cat">verb</f> <f name="person">3</f> <f name="number">singular</f> <f name="tense">present</f> <f name="indicative">true</f> </fs> </tag> <segmentation> <segment cat="verb" type="stem">wollen</segment> </segmentation> </analysis> <analysis tokenIDs="t_8"> <tag> <fs> <f name="cat">adverb</f> </fs> </tag> <segmentation> <segment cat="adverb" type="stem">dort</segment> </segmentation> </analysis> <analysis tokenIDs="t_9"> <tag> <fs> <f name="cat">noun</f> <f name="gender">masculine</f> <f name="case">accusative</f> <f name="number">singular</f> </fs> </tag> <segmentation> <segment cat="noun" type="stem">Urlaub</segment> </segmentation> </analysis> <analysis tokenIDs="t_10"> <tag> <fs> <f name="cat">verb</f> <f name="infinitive">true</f> </fs> </tag> <segmentation> <segment cat="verb" type="stem">machen</segment> </segmentation> </analysis> <analysis tokenIDs="t_11"> <tag> <fs> <f name="cat">punctuation</f> <f name="punctuation">normal</f> </fs> </tag> <segmentation> <segment cat="punctuation" type="stem">.</segment> </segmentation> </analysis> </morphology>
Named Entities
The namedEntities layer specifies named entity annotations on tokens. Each entity annotation references a token, or sequence of tokens, that represents this entity and specifies a named entity class (e.g. person, location, etc.). The layer specifies a tagset (as the value of the type attribute) used for named entity type tags. An example of named entities layer:
<namedEntities type="CoNLL2002"> <entity ID="ne_0" class="PER" tokenIDs="t_0"/> <entity ID="ne_1" class="LOC" tokenIDs="t_3 t_4"/> </namedEntities>
References
The references layer represents annotations on tokens, or sequences of tokens, that refer to the same entities. The layer consists of a list of entity elements. Each entity element consists of all the mentions (reference elements) that refer to this entity. Each reference element enumerates token identifiers, pointing to a sequence of tokens that represents this reference. Optionally, the minimum sequence of tokens of the reference can be specified (e.g. when reference is a long noun phrase, minimum representation is the head of the phrase). Linguistic type of the reference (pronoun/nominal/name/demonstrative/zero pronoun, other/finer distinctions are possible) type can be specified. Relation to another reference (to target reference) can be specified as well (anaphoric, cataphoric, coreferential, etc.). Additionally, an external identifier of the entity can be provided (url of a Wikipedia article for the entity, id of the entity in a database, etc.) References layer has optional attributes to specify type tagset for the case when linguistic types of references are specified, relation tagset for the case relations between references are specified, name of external reference source in case external references of entities are provided. An example of a references layer:
<references typetagset="BART" reltagset="TuebaDZ"> <entity> <reference ID="rc_0" tokenIDs="t_0" mintokIDs="t_0" type="nam"/> <reference ID="rc_1" tokenIDs="t_6" mintokIDs="t_6" type="pro.per3" rel="anaphoric" target="rc_0"/> </entity> <entity> <reference ID="rc_2" tokenIDs="t_3 t_4" mintokIDs="t_3 t_4" type="nam"/> <reference ID="rc_3" tokenIDs="t_8" mintokIDs="t_8" type="adv" rel="anaphoric" target="rc_2"/> </entity> </references>
Lexical-semantic Annotations
Lexical-semantic annotations are represented by four layers: synonymy, antonymy, hyponymy, hyperonymy. The layers enumerate the orthform elements. Each orthform element has an orthform string (can be a word, a phrase, or a list of both) and references lemmas that are in synonymy/antonymy/hyponymy/hyperonymy relations with this orthform. An example of a synonyms layer:
<synonymy> <orthform lemmaRefs="le_1">flattern, flirren, hinausfliegen</orthform> <orthform lemmaRefs="le_9">Ferien</orthform> <orthform lemmaRefs="le_10">auflegen, ausmachen, betätigen, bilden, erschaffen, formen, hervorbringen, produzieren, schaffen, schöpfen</orthform> </synonymy>
Additionally, the word senses layer (wsd) represents word sense annotations as lexical units from a word senses resource. The wsd layer specifies the resource from which the word sense annotations are generated, and lists word sense (ws) annotation elements. Each ws element has reference to the token/tokens that it annotates, and to the lexical unit/units from the word senses resource. It can have an optional comment. An example of a word senses layer:
<wsd src="GermaNet8.0"> <ws tokenIDs="t_1" lexunits="75069 75197" comment="übertragen"/> <ws tokenIDs="t_9" lexunits="-1" comment="unbestimmbar"/> <ws tokenIDs="t_10" lexunits="82896"/> </wsd>
Matches
The matches layer appears when the TCF data is generated by querying text corpus resources. The layer consists of the original query string and query type (ie the language of the query), and of corpus elements. Each corpus element has a name, a persistent identifier and a list of items. Each item references a token, or sequence of tokens, in the TCF tokens layer and a token, or sequence of tokens, in the original corpus. It specifies the target (if requested in the query), and categories (if additional information about the corresponding original resource can be given, such as document name, author, genre, etc.). An example of matches layer :
<matches> <query type="cqp">@[word='Urlaub' | ne='PER']</query> <corpus pid="11858/00-1778-0000-0001-TCFEXAMPLE-A" name="TCF04-EXAMPLE"> <item tokenIDs="t_0" srcIDs="0"> <target value="t_0" name="0"/> <category value="karin" name="text_filename"/> <category value="SFS, Uni Tuebingen" name="text_author"/> </item> <item tokenIDs="t_9" srcIDs="9"> <target value="t_9" name="0"/> <category value="karin" name="text_filename"/> <category value="SFS, Uni Tuebingen" name="text_author"/> </item> </corpus> </matches>
Word Splittings
The WordSplittings layer annotates tokens with regard to character intervals the token can be split into. The type of the splitting is specified at the layer level. Each split references the token it annotates and character offsets of the splits (count of characters should start at 0). An example WordSplittings layer:
<WordSplittings type="syllables"> <split tokID="t_0">2</split> <split tokID="t_9">2 4</split> <split tokID="t_10">2</split> </WordSplittings>
Geographical locations
The geo layer represents annotations for geographical locations. A token, or sequence of tokens, for which a geographical location can be identified, is annotated with longitude and latitude coordinates, optionally altitude, continent, country and capital. At the layer level, the attributes specify which format is used as longitude, latitude, continent, country and capital values of a geographical point. An example geo layer:
<geo coordFormat="DegDec" continentFormat="name" countryFormat="ISO3166_A2" capitalFormat="name"> <src>http://www.geonames.org/</src> <gpoint tokenIDs="t_3 t_4" alt="10" lat="40.714167" lon="-74.005833" continent="North America" country="US" capital="Washington"/> </geo>
Discourse connectives
The discourseconnectives layer annotates discourse connectives. For each discourse connective its type can be specified. In such a case, the tagset used for discourse connectives types should be specified on the layer level. An example discourseconnectives layer:
<discourseconnectives tagset="TuebaDZ"> <connective tokenIDs="t3" type="expansion"/> <connective tokenIDs="t5" type="temporal"/> </discourseconnectives>
Phonetics
The Phonetics layer annotates tokens with their phonetic pronunciation. The transcription system is specified in the corresponding attribute of the layer. An example Phonetics layer:
<Phonetics transcription="IPA"> <pron tokID="t_9">ˈuːɐ̯ˌlaʊ̯p</pron> </Phonetics>
Text structure
The textstructure layer preserves the original structure of a written text. Within the layer, a token sequence can be annotated as belonging to some text structure element, such as to a page, a paragraph, a line, a title, etc. textspan element represents text structure annotation on a token sequence. The token sequence is specified by start token id and end token id. The type of the text structure element is specified by the corresponding attribute. An example textstructure layer:
<textstructure> <textspan start="t_0" end="t_5" type="page"/> <textspan type="line"/> <textspan start="t_0" end="t_11" type="paragraph"/> <textspan start="t_0" end="t_1" type="line"/> <textspan start="t_2" end="t_5" type="line"/> <textspan start="t_6" end="t_11" type="page"/> <textspan start="t_6" end="t_8" type="line"/> <textspan start="t_9" end="t_11" type="line"/> <textspan type="line"/> </textstructure>
Orthography
The orthography layer annotates tokens with their correct orthographic transcription. For each correction a correction operation is specified. An example orthography layer:
<orthography> <correction operation="replace" tokenIDs="t_0">Karina</correction> </orthography>
Schema, Tutorials, Online Validator
Tutorials for working with TCF documents
TCF v0.4 Example This is the file that contains the XML of the complete example contained in this document.