Reading and writing TCF

From WebLichtWiki

Revision as of 18:29, 11 December 2013 by Yana (Talk | contribs)
Jump to: navigation, search

Contents

Introduction

In order to use an output of a WebLicht tool or to participate in WebLicht's tool-chaining, one must be able to read and/or write TCF. There are two options available:

  • The first option is to parse the TCF document, extract the necessary annotations and if building a service, - process them to build new annotations layer, and add new annotations to the document according to the TCF specifications. For creating and parsing TCF, one can use a DOM parser or a StAX parser. A StAX parser is the preferable option in most cases because a DOM parser holds an entire document in memory, while with a StAX parser the developer can decide which pieces of information are saved and how they are saved, having fine control over parsing efficiency and memory use.
  • The second option is to use the TCF binding library WLFXB which we offer for the developers and users of WebLicht services. It abstracts from the particular format of TCF and its parsing. The library binds linguistic data, represented by XML elements and attributes in TCF, to linguistic data represented by Java objects accessed from a WLData object, or from it components, - a TextCorpus object (for text corpus based annotations) or a Lexicon object (for dictionary based annotations). The library interface allows for easy access to the linguistic annotations in TCF and easy addition of new linguistic annotations. Thus, one can access/create linguistic annotations from/in a TextCorpus or Lexicon Java object directly, without dealing with XML, the TCF document will be read/written automatically into XML.

This tutorial is devoted to the second option and explains how to use the WLFXB binding library.

Getting started with WLFXB library

WLFXB library is stored in the Clarin repository [1]. You can use it in a maven project. For example, in NetBeans, do the following:

  • File -> New Project -> Maven -> Java Application and click Next
  • fill in the name and location of the project and click Finish
  • add clarin repository into the project's pom.xml file:
       <repository>
           <id>clarin</id>
           <url>http://catalog.clarin.eu/ds/nexus/content/repositories/Clarin/</url>
       </repository>
  • add wlfxb dependency into the project's pom.xml file (change the version to the latest version of wlfxb available):
       <dependency>
           <groupId>eu.clarin.weblicht</groupId>
           <artifactId>wlfxb</artifactId>
           <version>1.2.7</version>
       </dependency>

Two use cases are described in the tutorial. Procede to the one more similar to your own use case:

  • Producing linguistic annotations in TCF from scratch: you want to produce data with linguistic annotations in TCF from scratch. Within WebLicht this scenario is applied when you are implementing a converter from another format to TCF. The scenario is also applicable when you want to make your own data/corpus available to the user in TCF format.

Consuming and producing linguistic annotations from/into TCF

Producing linguistic annotations in TCF from scratch

General remarks on the WLFXB library usage

In similar to the shown in this tutorial way you can add or access any other annotation layers into the document. Some simplified examples for all the layers are available as test cases that you can download from Clarin repository [2]. TextCorpusLayerTag and LexiconLayerTag enumerations represent all the linguistic layers, available in TCF0.4. If you want to develop a service that produces a layer other than the ones listed, please contact us with the suggestion.


In general the approach of the wlfxb library is the following:

  • create/access layers from the corresponding TextCorpus/Lexicon object using the following implementations:
    • if you are creating the document from scratch you would commonly use TextCorpusStored/LexiconStored implementation and then write the document using WLDObjector and WLData
    • if you are reading all the linguistic data from the document you would commonly use WLDObjector, get from it a TextCorpusStored/LexiconStored object and then get the annotation layers from it
    • if you are reading only particular annotation layers from the document, you would commonly use TextCorpusStreamed/LexiconStreamed implementation, so that only the layers you request are loaded into the memory
  • a layer method cannot be accessed if you didn't read this layer from input or created this layer yourself
  • no more than one layer of a given type within the TCF document can be created
  • create and add layer annotations by using the methods from the corresponding layer object
  • access the annotations of the layer A via that layer A, e.g.:
       // get the first token
       Token token = textCorpus.getTokensLayer().getToken(0);
       // get the first sentence
       Sentence sentence = textCorpus.getSentencesLayer().getSentence(0);
       // get the first entry
       Entry entry = lexicon.getEntriesLayer().getEntry(0);
       // get the first frequency annotation
       Frequency freq = lexicon.getFrequenciesLayer().getFrequency(0);   
  • access the annotations of the other layer B referenced from the layer A via this layer A as well, e.g:
       //get all the tokens of the sentence
       Token[] tokens = textCorpus.getSentencesLayer().getTokens(sentence);
       //get tokens of the part-of-speech annotation
       Token[] tokens = textCorpus.getPosTagsLayer().getTokens(pos);
       //get entry of the frequency annotation
       Entry entry = lexicon.getFrequenciesLayer().getEntry(freq)
       //get entries of the part-of-speech annotation
       Entry[] entries = lexicon.getPosTagsLayer().getEntries(pos);
       
  • it is necessary to close TextCorpusStreamed object after finishing reading/writing the required annotations
  • apply WLDObjector for writing the object in TCF format if you work with TextCorpusStored/LexiconStored object, and if you are passing the streams to the WLDObjector methods, you should close the streams yourself.