Reading and writing TCF

From WebLichtWiki

(Difference between revisions)
Jump to: navigation, search
(Created page with "= Introduction = In order to use an output of a WebLicht tool or to participate in WebLicht's tool-chaining, one must be able to read and/or write TCF. Th...")
 
Line 1: Line 1:
 
= Introduction =
 
= Introduction =
  
In order to use an output of a WebLicht tool or to participate in WebLicht's tool-chaining, one must be able to read and/or write [[The_TCF_Format|TCF]].
+
In order to use an output of a WebLicht tool or to participate in WebLicht's tool-chaining, one must be able to read and/or write [[The_TCF_Format|TCF]]. There are two options available:
 
+
There are two options available:
+
  
 
*The first option is to parse the TCF document, extract the necessary annotations and if building a service, - process them to build new annotations layer, and add new annotations to the document according to the TCF specifications. For creating and parsing TCF, one can use a DOM parser or a StAX parser. A StAX parser is the preferable option in most cases because a DOM parser holds an entire document in memory, while with a StAX parser the developer can decide which pieces of information are saved and how they are saved, having fine control over parsing efficiency and memory use.
 
*The first option is to parse the TCF document, extract the necessary annotations and if building a service, - process them to build new annotations layer, and add new annotations to the document according to the TCF specifications. For creating and parsing TCF, one can use a DOM parser or a StAX parser. A StAX parser is the preferable option in most cases because a DOM parser holds an entire document in memory, while with a StAX parser the developer can decide which pieces of information are saved and how they are saved, having fine control over parsing efficiency and memory use.
Line 11: Line 9:
 
This tutorial is devoted to the second option and explains how to use the WLFXB binding library.
 
This tutorial is devoted to the second option and explains how to use the WLFXB binding library.
  
= Getting started =
+
 
 +
= Getting started with WLFXB library =
 +
 
 
WLFXB library is stored in the Clarin repository [http://catalog.clarin.eu/ds/nexus/content/repositories/Clarin/]. You can use it in a maven project. For example, in NetBeans, do the following:
 
WLFXB library is stored in the Clarin repository [http://catalog.clarin.eu/ds/nexus/content/repositories/Clarin/]. You can use it in a maven project. For example, in NetBeans, do the following:
  

Revision as of 17:31, 11 December 2013

Contents

Introduction

In order to use an output of a WebLicht tool or to participate in WebLicht's tool-chaining, one must be able to read and/or write TCF. There are two options available:

  • The first option is to parse the TCF document, extract the necessary annotations and if building a service, - process them to build new annotations layer, and add new annotations to the document according to the TCF specifications. For creating and parsing TCF, one can use a DOM parser or a StAX parser. A StAX parser is the preferable option in most cases because a DOM parser holds an entire document in memory, while with a StAX parser the developer can decide which pieces of information are saved and how they are saved, having fine control over parsing efficiency and memory use.
  • The second option is to use the TCF binding library WLFXB which we offer for the developers and users of WebLicht services. It abstracts from the particular format of TCF and its parsing. The library binds linguistic data, represented by XML elements and attributes in TCF, to linguistic data represented by Java objects accessed from a WLData object, or from it components, - a TextCorpus object (for text corpus based annotations) or a Lexicon object (for dictionary based annotations). The library interface allows for easy access to the linguistic annotations in TCF and easy addition of new linguistic annotations. Thus, one can access/create linguistic annotations from/in a TextCorpus or Lexicon Java object directly, without dealing with XML, the TCF document will be read/written automatically into XML.

This tutorial is devoted to the second option and explains how to use the WLFXB binding library.


Getting started with WLFXB library

WLFXB library is stored in the Clarin repository [1]. You can use it in a maven project. For example, in NetBeans, do the following:

  • TODO

Add clarin repository into pom.xml:

       <repository>
           <id>clarin</id>
           <url>http://catalog.clarin.eu/ds/nexus/content/repositories/Clarin/</url>
       </repository>

Add wlfxb dependency into pom.xml. Change the version to the latest version of wlfxb available:

       <dependency>
           <groupId>eu.clarin.weblicht</groupId>
           <artifactId>wlfxb</artifactId>
           <version>1.2.7</version>
       </dependency>

Two use cases are described in the tutorial. Procede to the one more appropriate for your use case:

  • Producing linguistic annotations in TCF from scratch: you want to produce data with linguistic annotations in TCF from scratch. Within WebLicht this scenario is applied when you are implementing a converter from another format to TCF. The scenario is also applicable when you want to make your own data/corpus available to the user in TCF format.

Consuming and producing linguistic annotations from/into TCF

Producing linguistic annotations in TCF from scratch

General remarks on the WLFXB library usage