Reading and writing TCF

From WebLichtWiki

(Difference between revisions)
Jump to: navigation, search
(Getting started with WLFXB library)
Line 14: Line 14:
 
WLFXB library is stored in the Clarin repository [http://catalog.clarin.eu/ds/nexus/content/repositories/Clarin/]. You can use it in a maven project. For example, in NetBeans, do the following:
 
WLFXB library is stored in the Clarin repository [http://catalog.clarin.eu/ds/nexus/content/repositories/Clarin/]. You can use it in a maven project. For example, in NetBeans, do the following:
  
* TODO
+
* '''File''' -> '''New Project''' -> '''Maven''' -> '''Java Application''' and click '''Next'''
  
Add clarin repository into pom.xml:
+
* fill in the name and location of the project and click '''Finish'''
 +
 
 +
* add clarin repository into the project's pom.xml file:
  
 
         <repository>
 
         <repository>
Line 23: Line 25:
 
         </repository>
 
         </repository>
  
Add wlfxb dependency into pom.xml. Change the version to the latest version of wlfxb available:
+
* add wlfxb dependency into the project's pom.xml file (change the version to the latest version of wlfxb available):
  
 
         <dependency>
 
         <dependency>
Line 31: Line 33:
 
         </dependency>
 
         </dependency>
  
Two use cases are described in the tutorial. Procede to the one more appropriate for your use case:
+
Two use cases are described in the tutorial. Procede to the one more similar to your own use case:
  
 
*[[#Consuming_and_producing_linguistic_annotations_from.2Finto_TCF | Consuming and producing linguistic annotations from/into TCF]]: you have a TCF input with linguistic annotations, you want to process these annotation and produce new linguistic annotations into the TCF document. This this the most common scenario for a WebLicht service.
 
*[[#Consuming_and_producing_linguistic_annotations_from.2Finto_TCF | Consuming and producing linguistic annotations from/into TCF]]: you have a TCF input with linguistic annotations, you want to process these annotation and produce new linguistic annotations into the TCF document. This this the most common scenario for a WebLicht service.

Revision as of 18:04, 11 December 2013

Contents

Introduction

In order to use an output of a WebLicht tool or to participate in WebLicht's tool-chaining, one must be able to read and/or write TCF. There are two options available:

  • The first option is to parse the TCF document, extract the necessary annotations and if building a service, - process them to build new annotations layer, and add new annotations to the document according to the TCF specifications. For creating and parsing TCF, one can use a DOM parser or a StAX parser. A StAX parser is the preferable option in most cases because a DOM parser holds an entire document in memory, while with a StAX parser the developer can decide which pieces of information are saved and how they are saved, having fine control over parsing efficiency and memory use.
  • The second option is to use the TCF binding library WLFXB which we offer for the developers and users of WebLicht services. It abstracts from the particular format of TCF and its parsing. The library binds linguistic data, represented by XML elements and attributes in TCF, to linguistic data represented by Java objects accessed from a WLData object, or from it components, - a TextCorpus object (for text corpus based annotations) or a Lexicon object (for dictionary based annotations). The library interface allows for easy access to the linguistic annotations in TCF and easy addition of new linguistic annotations. Thus, one can access/create linguistic annotations from/in a TextCorpus or Lexicon Java object directly, without dealing with XML, the TCF document will be read/written automatically into XML.

This tutorial is devoted to the second option and explains how to use the WLFXB binding library.


Getting started with WLFXB library

WLFXB library is stored in the Clarin repository [1]. You can use it in a maven project. For example, in NetBeans, do the following:

  • File -> New Project -> Maven -> Java Application and click Next
  • fill in the name and location of the project and click Finish
  • add clarin repository into the project's pom.xml file:
       <repository>
           <id>clarin</id>
           <url>http://catalog.clarin.eu/ds/nexus/content/repositories/Clarin/</url>
       </repository>
  • add wlfxb dependency into the project's pom.xml file (change the version to the latest version of wlfxb available):
       <dependency>
           <groupId>eu.clarin.weblicht</groupId>
           <artifactId>wlfxb</artifactId>
           <version>1.2.7</version>
       </dependency>

Two use cases are described in the tutorial. Procede to the one more similar to your own use case:

  • Producing linguistic annotations in TCF from scratch: you want to produce data with linguistic annotations in TCF from scratch. Within WebLicht this scenario is applied when you are implementing a converter from another format to TCF. The scenario is also applicable when you want to make your own data/corpus available to the user in TCF format.

Consuming and producing linguistic annotations from/into TCF

Producing linguistic annotations in TCF from scratch

General remarks on the WLFXB library usage