The TCF Format

From WebLichtWiki

Revision as of 10:04, 23 May 2012 by Meh (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Background on WebLicht and Motivation for TCF

WebLicht is a service orchestration and execution environment for incremental automatic annotation of text corpora, built upon Service Oriented Architecture principles. Language processing tools (tokenizers, taggers, parsers, etc), provided as webservices, have been integrated into WebLicht.

A WebLicht webservice is simply a synchronous REST‐style web service. The service processes the input data synchronously and returns the result as output data, using the same HTTP connection. Each tool, implemented as a webservice, is hosted locally at the institution where it was developed.

Normally, a tool requires one or more annotation layers to be present in the input document, and those layers are used to generate further annotations. The WebLicht web application helps users to combine the available tools into valid processing chains. Running a processing chain means that tools are invoked one after the other, where the output of each tool is used as input for the next tool in the chain. Tool execution is performed in a distributed fashion by calls to the servers that host the tools in question.

Many of the tools that have been integrated into WebLicht are well known within the linguistic community and have been in use for many years. They were mainly developed as command‐line tools and each uses its own format for storing results. For this reason, it was not possible to combine such tools into processing chains easily. In order for the tools to “talk to each other”, it was necessary to create a highly efficient internal format for sharing data between services.

A common XML data exchange format (Text Corpus Format, TCF) has been developed within the WebLicht architecture to facilitate efficient interoperability between the tools The TCF format allows the various linguistic annotations produced by the tools within WebLicht to be stored in one document. It supports incremental enrichment of linguistic annotations at various levels of analysis in a stand-­off XML‐based format. Each tool may add one or more annotation layers to the data document, but tools are not permitted to remove or alter any existing layers within the document. This means that the document grows with each annotation:


TCF-Add-Layers.png


TCF is fully compatible with the Linguistic Annotation Format (LAF) and Graph-­‐based Format for Linguistic Annotations (GrAF) developed in the ISO/TC37/SC4 technical committee (Ide & Suderman, 2007). It needs to be stressed that TCF is not meant as “yet another standard”. Rather, TCF is used as an internal processing format to support efficient data sharing and web service execution. Regarding the relationship of the processing format TCF and well-­‐established standard formats, it should also be noted that WebLicht contains a set of web services to allow automatic conversions between TCF and most of the other formats used in the linguistic community.