MediaWiki API Result

You are looking at the HTML representation of the XML format.
HTML is good for debugging, but probably is not suitable for your application.
See complete documentation, or API help for more information.
<?xml version="1.0"?>
<api>
  <query-continue>
    <allpages gapfrom="Response and Request Entity Bodies and Java types" />
  </query-continue>
  <query>
    <pages>
      <page pageid="160" ns="0" title="Reading and writing TCF">
        <revisions>
          <rev xml:space="preserve">= Introduction =

In order to use an output of a WebLicht tool or to participate in WebLicht's tool-chaining, one must be able to read and/or write [[The_TCF_Format|TCF]]. There are two options available:

*The first option is to parse the TCF document, extract the necessary annotations and if building a service, - process them to build new annotations layer, and add new annotations to the document according to the TCF specifications. For creating and parsing TCF, one can use a DOM parser or a StAX parser. A StAX parser is the preferable option in most cases because a DOM parser holds an entire document in memory, while with a StAX parser the developer can decide which pieces of information are saved and how they are saved, having fine control over parsing efficiency and memory use.

*The second option is to use the TCF binding library WLFXB which we offer for the developers and users of WebLicht services. It abstracts from the particular format of TCF and its parsing. The library binds linguistic data, represented by XML elements and attributes in TCF, to linguistic data represented by Java objects accessed from a WLData object, or from it components, - a TextCorpus object (for text corpus based annotations) or a Lexicon object (for dictionary based annotations). The library interface allows for easy access to the linguistic annotations in TCF and easy addition of new linguistic annotations. Thus, one can access/create linguistic annotations from/in a TextCorpus or Lexicon Java object directly, without dealing with XML, the TCF document will be read/written automatically into XML.

This tutorial is devoted to the second option and explains how to use the WLFXB binding library.

= Getting started with WLFXB library =

WLFXB library is available on [https://github.com/weblicht/wlfxb GitHub] and as a dependency on [https://search.maven.org/artifact/eu.clarin.weblicht/wlfxb Maven]. To use it in a maven project, for example in NetBeans, do the following:

* '''File''' -&gt; '''New Project''' -&gt; '''Maven''' -&gt; '''Java Application''' and click '''Next'''

* fill in the name and location of the project and click '''Finish'''
[[File:Wlfxb-create-project.png]]

* add wlfxb dependency into the project's pom.xml file (change the version to the latest version of wlfxb available at https://search.maven.org/artifact/eu.clarin.weblicht/wlfxb):

    &lt;dependencies&gt;
        &lt;dependency&gt;
            &lt;groupId&gt;eu.clarin.weblicht&lt;/groupId&gt;
            &lt;artifactId&gt;wlfxb&lt;/artifactId&gt;
            &lt;version&gt;1.4.3&lt;/version&gt;
        &lt;/dependency&gt;
    &lt;/dependencies&gt;


Two use cases are described in the tutorial. Proceed to the one more similar to your own use case:

*[[#Consuming_and_producing_linguistic_annotations_from.2Finto_TCF | Consuming and producing linguistic annotations from/into TCF]]: you have a TCF input with linguistic annotations, you want to process these annotation and produce new linguistic annotations into the TCF document. This this the most common scenario for a WebLicht service.

*[[#Producing_linguistic_annotations_in_TCF_from_scratch | Producing linguistic annotations in TCF from scratch]]: you want to produce data with linguistic annotations in TCF from scratch. Within WebLicht this scenario is applied when you are implementing a converter from another format to TCF. The scenario is also applicable when you want to make your own data/corpus available to the user in TCF format.

= Consuming and producing linguistic annotations from/into TCF =

Let's consider an example of part-of-speech tagger that processes tokens sentence by sentence and annotates them with part-of-speech tags. Our tagger will be very naive: it will annotate all the tokens with NN tag. The real tagger would be more itelligent and will assign tag based on the token string, the sentence context, and previous tags. Therefore, the input TCF should contain at least tokens and sentences layers:

&lt;pre&gt;
&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
&lt;?xml-model href=&quot;http://de.clarin.eu/images/weblicht-tutorials/resources/tcf-04/schemas/latest/d-spin_0_4.rnc&quot; 
            type=&quot;application/relax-ng-compact-syntax&quot;?&gt;
&lt;D-Spin xmlns=&quot;http://www.dspin.de/data&quot; version=&quot;0.4&quot;&gt;
	&lt;MetaData xmlns=&quot;http://www.dspin.de/data/metadata&quot;&gt;&lt;/MetaData&gt;
	&lt;TextCorpus xmlns=&quot;http://www.dspin.de/data/textcorpus&quot; lang=&quot;de&quot;&gt;
		&lt;text&gt;Karin fliegt nach New York. Sie will dort Urlaub machen.&lt;/text&gt;
		&lt;tokens&gt;
			&lt;token ID=&quot;t_0&quot;&gt;Karin&lt;/token&gt;
			&lt;token ID=&quot;t_1&quot;&gt;fliegt&lt;/token&gt;
			&lt;token ID=&quot;t_2&quot;&gt;nach&lt;/token&gt;
			&lt;token ID=&quot;t_3&quot;&gt;New&lt;/token&gt;
			&lt;token ID=&quot;t_4&quot;&gt;York&lt;/token&gt;
			&lt;token ID=&quot;t_5&quot;&gt;.&lt;/token&gt;
			&lt;token ID=&quot;t_6&quot;&gt;Sie&lt;/token&gt;
			&lt;token ID=&quot;t_7&quot;&gt;will&lt;/token&gt;
			&lt;token ID=&quot;t_8&quot;&gt;dort&lt;/token&gt;
			&lt;token ID=&quot;t_9&quot;&gt;Urlaub&lt;/token&gt;
			&lt;token ID=&quot;t_10&quot;&gt;machen&lt;/token&gt;
			&lt;token ID=&quot;t_11&quot;&gt;.&lt;/token&gt;
		&lt;/tokens&gt;
		&lt;sentences&gt;
			&lt;sentence tokenIDs=&quot;t_0 t_1 t_2 t_3 t_4 t_5&quot;&gt;&lt;/sentence&gt;
			&lt;sentence tokenIDs=&quot;t_6 t_7 t_8 t_9 t_10 t_11&quot;&gt;&lt;/sentence&gt;
		&lt;/sentences&gt;
	&lt;/TextCorpus&gt;
&lt;/D-Spin&gt;&lt;/pre&gt;

In the project that you've created in [[#Getting_started_with_WLFXB_library | previous section]], create class ''POSTaggerForTextCorpus'':

[[File:Wlfxb-create-postagger.png]]

&lt;pre&gt;
package clarind.my.project;

import eu.clarin.weblicht.wlfxb.io.TextCorpusStreamed;
import eu.clarin.weblicht.wlfxb.io.WLFormatException;
import eu.clarin.weblicht.wlfxb.tc.api.PosTagsLayer;
import eu.clarin.weblicht.wlfxb.tc.api.Sentence;
import eu.clarin.weblicht.wlfxb.tc.api.Token;
import eu.clarin.weblicht.wlfxb.tc.xb.TextCorpusLayerTag;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.EnumSet;

public class POSTaggerForTextCorpus {

    public void process(InputStream is, OutputStream os) throws WLFormatException {
        // specify layers to be read for processing
        EnumSet&lt;TextCorpusLayerTag&gt; layersToRead = EnumSet.of(TextCorpusLayerTag.TOKENS, TextCorpusLayerTag.SENTENCES);
        // create TextCorpus object that reads the specified layers into the memory
        TextCorpusStreamed textCorpus = new TextCorpusStreamed(is, layersToRead, os);
        // create empty pos taggers layer for the part-of-speech annotations to be added
        PosTagsLayer posLayer = textCorpus.createPosTagsLayer(&quot;STTS&quot;);
        // iterate on sentences
        for (int i = 0; i &lt; textCorpus.getSentencesLayer().size(); i++) {
            // access each sentence
            Sentence sentence = textCorpus.getSentencesLayer().getSentence(i);
            // access tokens of each sentence
            Token[] tokens = textCorpus.getSentencesLayer().getTokens(sentence);
            for (int j = 0; j &lt; tokens.length; j++) {
                // add part-of-speech annotation to each token
                posLayer.addTag(&quot;NN&quot;, tokens[j]);
            }
        }
        // write new annotations into te output and close the tcf streams 
        textCorpus.close();
    }

    public static void main(String[] args) throws java.io.IOException, WLFormatException {
        if (args.length != 2) {
            System.out.println(&quot;Provide args:&quot;);
            System.out.println(&quot;PATH_TO_INPUT_TCF PATH_TO_OUTPUT_TCF&quot;);
            return;
        }
        FileInputStream fis = null;
        FileOutputStream fos = null;
        try {
            fis = new FileInputStream(args[0]);
            fos = new FileOutputStream(args[1]);
            POSTaggerForTextCorpus tagger = new POSTaggerForTextCorpus();
            tagger.process(fis, fos);
        } finally {
            if (fis != null) {
                fis.close();
            }
            if (fos != null) {
                fos.close();
            }
        }
    }
}
&lt;/pre&gt;

Compile and run the ''POSTaggingTextCorpus'' class providing two arguments: input and output TCF files. For example in NetBeans, right click on the project, select '''Properties''' -&gt; '''Run''', specify the main class and arguments, click '''Finish''':

[[File:Wlfxb-run-postagger.png]]

Then right click on the project and select '''Run''':

[[File:Wlfxb-run-project.png]]


Inspect the output in the output file you have provided as a second argument to the class. Running the project should produce the valid TCF into the output with POSTags layer added:

&lt;pre&gt;
&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
&lt;?xml-model href=&quot;http://de.clarin.eu/images/weblicht-tutorials/resources/tcf-04/schemas/latest/d-spin_0_4.rnc&quot;
            type=&quot;application/relax-ng-compact-syntax&quot;?&gt;
&lt;D-Spin xmlns=&quot;http://www.dspin.de/data&quot; version=&quot;0.4&quot;&gt;
	&lt;MetaData xmlns=&quot;http://www.dspin.de/data/metadata&quot;&gt;&lt;/MetaData&gt;
	&lt;TextCorpus xmlns=&quot;http://www.dspin.de/data/textcorpus&quot; lang=&quot;de&quot;&gt;
		&lt;text&gt;Karin fliegt nach New York. Sie will dort Urlaub machen.&lt;/text&gt;
		&lt;tokens&gt;
			&lt;token ID=&quot;t_0&quot;&gt;Karin&lt;/token&gt;
			&lt;token ID=&quot;t_1&quot;&gt;fliegt&lt;/token&gt;
			&lt;token ID=&quot;t_2&quot;&gt;nach&lt;/token&gt;
			&lt;token ID=&quot;t_3&quot;&gt;New&lt;/token&gt;
			&lt;token ID=&quot;t_4&quot;&gt;York&lt;/token&gt;
			&lt;token ID=&quot;t_5&quot;&gt;.&lt;/token&gt;
			&lt;token ID=&quot;t_6&quot;&gt;Sie&lt;/token&gt;
			&lt;token ID=&quot;t_7&quot;&gt;will&lt;/token&gt;
			&lt;token ID=&quot;t_8&quot;&gt;dort&lt;/token&gt;
			&lt;token ID=&quot;t_9&quot;&gt;Urlaub&lt;/token&gt;
			&lt;token ID=&quot;t_10&quot;&gt;machen&lt;/token&gt;
			&lt;token ID=&quot;t_11&quot;&gt;.&lt;/token&gt;
		&lt;/tokens&gt;
		&lt;sentences&gt;
			&lt;sentence tokenIDs=&quot;t_0 t_1 t_2 t_3 t_4 t_5&quot;&gt;&lt;/sentence&gt;
			&lt;sentence tokenIDs=&quot;t_6 t_7 t_8 t_9 t_10 t_11&quot;&gt;&lt;/sentence&gt;
		&lt;/sentences&gt;
		&lt;POStags tagset=&quot;STTS&quot;&gt;
			&lt;tag tokenIDs=&quot;t_0&quot;&gt;NN&lt;/tag&gt;
			&lt;tag tokenIDs=&quot;t_1&quot;&gt;NN&lt;/tag&gt;
			&lt;tag tokenIDs=&quot;t_2&quot;&gt;NN&lt;/tag&gt;
			&lt;tag tokenIDs=&quot;t_3&quot;&gt;NN&lt;/tag&gt;
			&lt;tag tokenIDs=&quot;t_4&quot;&gt;NN&lt;/tag&gt;
			&lt;tag tokenIDs=&quot;t_5&quot;&gt;NN&lt;/tag&gt;
			&lt;tag tokenIDs=&quot;t_6&quot;&gt;NN&lt;/tag&gt;
			&lt;tag tokenIDs=&quot;t_7&quot;&gt;NN&lt;/tag&gt;
			&lt;tag tokenIDs=&quot;t_8&quot;&gt;NN&lt;/tag&gt;
			&lt;tag tokenIDs=&quot;t_9&quot;&gt;NN&lt;/tag&gt;
			&lt;tag tokenIDs=&quot;t_10&quot;&gt;NN&lt;/tag&gt;
			&lt;tag tokenIDs=&quot;t_11&quot;&gt;NN&lt;/tag&gt;
		&lt;/POStags&gt;
	&lt;/TextCorpus&gt;
&lt;/D-Spin&gt;
&lt;/pre&gt;


Now let's see how the TCF input document is processed and output document is produced. First, we create a ''TextCorpusStreamed'' object that will automatically handle TCF reading/writing. To create ''TextCorpusStreamed'' object that consumes input TCF and produces output TCF you need to specify annotation layers you want to read from TCF input.

In this example we only read token annotations and sentences annotations in order to assign part-of-speech tags. Only the layer/layers you've specified will be read into the memory, other layer annotations present in the input will be skipped. For example our sample input also contains text layer, but it's not loaded into the memory.

&lt;pre&gt;
        // specify layers to be read for processing
        EnumSet&lt;TextCorpusLayerTag&gt; layersToRead = EnumSet.of(TextCorpusLayerTag.TOKENS, TextCorpusLayerTag.SENTENCES);
        // create TextCorpus object that reads the specified layers into the memory
        TextCorpusStreamed textCorpus = new TextCorpusStreamed(is, layersToRead, os);
&lt;/pre&gt;

Then we create part-of-speech tags layer (so far empty) and specify the part-of-speech tagset used:

&lt;pre&gt;
        // create empty pos taggers layer for the part-of-speech annotations to be added
        PosTagsLayer posLayer = textCorpus.createPosTagsLayer(&quot;STTS&quot;);
&lt;/pre&gt;

Then we can get the data from the layer/layers that you have specified for reading. In our case they are only the sentences and tokens layers. If you try to get other layers that were not specified when constructing ''TextCorpusStreamed'' object, or the layers not present in the input, you'll get an Exception.

Then we iterate on sentences tokens and assign a part-of-speech tag to each token, adding it to the TCF:

&lt;pre&gt;
        for (int i = 0; i &lt; textCorpus.getSentencesLayer().size(); i++) {
            // access each sentence
            Sentence sentence = textCorpus.getSentencesLayer().getSentence(i);
            // access tokens of each sentence
            Token[] tokens = textCorpus.getSentencesLayer().getTokens(sentence);
            for (int j = 0; j &lt; tokens.length; j++) {
                // add part-of-speech annotation to each token
                posLayer.addTag(&quot;NN&quot;, tokens[j]);
            }
        }
&lt;/pre&gt;
        
After we have added all the annotations we wanted into the ''TextCorpusStreamed'' object, we should close the ''TextCorpusStreamed'' object and the TCF document is ready:

&lt;pre&gt;
    	textCorpus.close();
&lt;/pre&gt;

It's important to call ''close()'' method, so that the new annotations are written into the output stream and both underlying input and output streams are closed.

= Producing linguistic annotations in TCF from scratch =

Let's consider how to create a new TCF document when you have no TCF input to process. For simplicity in this example, we will hard-code the linguistic annotations to be created. We will create a new TCF document with these linguistic annotations.

The linguistic annotations we are going to add into the TCF are text, token and sentence annotations for the German language (&quot;de&quot;). In the project that you've created in [[#Getting_started_with_WLFXB_library | previous section]], create class ''CreatorTextTokensSentencesInTextCorpus'':

&lt;pre&gt;
package clarind.my.project;

import eu.clarin.weblicht.wlfxb.io.WLDObjector;
import eu.clarin.weblicht.wlfxb.io.WLFormatException;
import eu.clarin.weblicht.wlfxb.tc.api.SentencesLayer;
import eu.clarin.weblicht.wlfxb.tc.api.Token;
import eu.clarin.weblicht.wlfxb.tc.api.TokensLayer;
import eu.clarin.weblicht.wlfxb.tc.xb.TextCorpusStored;
import eu.clarin.weblicht.wlfxb.xb.WLData;
import java.io.FileOutputStream;
import java.io.OutputStream;
import java.util.ArrayList;
import java.util.List;


public class CreatorTextTokensSentencesInTextCorpus {

    private static final String myText = &quot;Karin fliegt nach New York. Sie will dort Urlaub machen.&quot;;
    private static final List&lt;String[]&gt; mySentences = new ArrayList&lt;String[]&gt;();

    static {
        mySentences.add(new String[]{&quot;Karin&quot;, &quot;fliegt&quot;, &quot;nach&quot;, &quot;New&quot;, &quot;York&quot;, &quot;.&quot;});
        mySentences.add(new String[]{&quot;Sie&quot;, &quot;will&quot;, &quot;dort&quot;, &quot;Urlaub&quot;, &quot;machen&quot;, &quot;.&quot;});
    }

    public void process(OutputStream os) throws WLFormatException {
        // create TextCorpus for German language
        TextCorpusStored textCorpus = new TextCorpusStored(&quot;de&quot;);
        // create text layer and add text
        textCorpus.createTextLayer().addText(myText);
        // create tokens layer
        TokensLayer tokensLayer = textCorpus.createTokensLayer();
        // create sentences layer
        SentencesLayer sentencesLayer = textCorpus.createSentencesLayer();
        for (String[] tokenizedSentence : mySentences) {
            // prepare temporary list to store this sentence tokens
            List&lt;Token&gt; sentenceTokens = new ArrayList&lt;Token&gt;();
            // iterate token by token
            for (String tokenString : tokenizedSentence) {
                // create token annotation and add it into the tokens annotation layer:
                Token token = tokensLayer.addToken(tokenString);
                // add it into temporary list that stores this sentence tokens
                sentenceTokens.add(token);
            }
            // create sentence annotation and add it into the sentences annotation layer
            sentencesLayer.addSentence(sentenceTokens);
        }
        //write the created object with all its annotations as xml output in a proper TCF format:
        WLData wlData = new WLData(textCorpus);
        WLDObjector.write(wlData, os);
    }

    public static void main(String[] args) throws java.io.IOException, WLFormatException {
        if (args.length != 1) {
            System.out.println(&quot;Provide arg:&quot;);
            System.out.println(&quot;PATH_TO_OUTPUT_TCF&quot;);
            return;
        }
        FileOutputStream fos = null;
        try {
            fos = new FileOutputStream(args[0]);
            CreatorTextTokensSentencesInTextCorpus creator = new CreatorTextTokensSentencesInTextCorpus();
            creator.process(fos);
        } finally {
            if (fos != null) {
                fos.close();
            }
        }
    }
}
&lt;/pre&gt;

Compile and run the ''CreatorTextTokensSentencesInTextCorpus'' class with the argument output TCF file. For example in NetBeans, right click on the project, select '''Properties''' -&gt; '''Run''', specify the main class as ''CreatorTextTokensSentencesInTextCorpus'' and the file path for TCF output argument, click '''Finish'''. 

[[File:Wlfxb-run-tccreator.png]]

Then right click on the project and select '''Run''':

[[File:Wlfxb-run-project.png]]


Inspect the output in the output file you have provided as a second argument to the class. Running the project should produce the valid TCF output that you can inspect in the output TCF file:

&lt;pre&gt;
&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
&lt;?xml-model href=&quot;http://de.clarin.eu/images/weblicht-tutorials/resources/tcf-04/schemas/latest/d-spin_0_4.rnc&quot; 
			type=&quot;application/relax-ng-compact-syntax&quot;?&gt;
&lt;D-Spin xmlns=&quot;http://www.dspin.de/data&quot; version=&quot;0.4&quot;&gt;
&lt;MetaData xmlns=&quot;http://www.dspin.de/data/metadata&quot;/&gt;
&lt;TextCorpus xmlns=&quot;http://www.dspin.de/data/textcorpus&quot; lang=&quot;de&quot;&gt;
    &lt;text&gt;Karin fliegt nach New York. Sie will dort Urlaub machen.&lt;/text&gt;
    &lt;tokens&gt;
        &lt;token ID=&quot;t_0&quot;&gt;Karin&lt;/token&gt;
        &lt;token ID=&quot;t_1&quot;&gt;fliegt&lt;/token&gt;
        &lt;token ID=&quot;t_2&quot;&gt;nach&lt;/token&gt;
        &lt;token ID=&quot;t_3&quot;&gt;New&lt;/token&gt;
        &lt;token ID=&quot;t_4&quot;&gt;York&lt;/token&gt;
        &lt;token ID=&quot;t_5&quot;&gt;.&lt;/token&gt;
        &lt;token ID=&quot;t_6&quot;&gt;Sie&lt;/token&gt;
        &lt;token ID=&quot;t_7&quot;&gt;will&lt;/token&gt;
        &lt;token ID=&quot;t_8&quot;&gt;dort&lt;/token&gt;
        &lt;token ID=&quot;t_9&quot;&gt;Urlaub&lt;/token&gt;
        &lt;token ID=&quot;t_10&quot;&gt;machen&lt;/token&gt;
        &lt;token ID=&quot;t_11&quot;&gt;.&lt;/token&gt;
    &lt;/tokens&gt;
    &lt;sentences&gt;
        &lt;sentence tokenIDs=&quot;t_0 t_1 t_2 t_3 t_4 t_5&quot;/&gt;
        &lt;sentence tokenIDs=&quot;t_6 t_7 t_8 t_9 t_10 t_11&quot;/&gt;
    &lt;/sentences&gt;
&lt;/TextCorpus&gt;
&lt;/D-Spin&gt;
&lt;/pre&gt;


Now let's look in more detail into the ''CreatorTextTokensSentencesInTextCorpus'' to see how the TCF document is created. First, we create a ''TextCorpusStored'' object that automatically handles TCF TextCorpus creation from scratch. We specify the language of the data as German (de):

&lt;pre&gt;
        // create TextCorpus for German language
        TextCorpusStored textCorpus = new TextCorpusStored(&quot;de&quot;);
&lt;/pre&gt;

Now we can add annotation layers into the document. This is how to add a text layer into TCF:

&lt;pre&gt;
        // create text layer and add text
        textCorpus.createTextLayer().addText(myText);
&lt;/pre&gt;

In case of more complex annotation layers you would first need to create objects that correspond to these annotation layers and then create/add annotations themselves one by one:

&lt;pre&gt;
    	TokensLayer tokensLayer = textCorpus.createTokensLayer();
    	tokensLayer.addToken(tokenString1);
    	tokensLayer.addToken(tokenString2);
&lt;/pre&gt;

When all the annotations are added, we write the created object with all its annotations as xml output in a proper TCF format:

&lt;pre&gt;
        WLData wlData = new WLData(textCorpus);
        WLDObjector.write(wlData, os);
&lt;/pre&gt;

After that, the TCF document is ready.

= General remarks on the WLFXB library usage =

In similar to the shown in this tutorial way you can add or access any other annotation layers into the document. Some simplified examples for all the layers are available as test cases that you can find in the source code repository on [https://github.com/weblicht/wlfxb/tree/master/src/test/resources/data GitHub]. TextCorpusLayerTag and LexiconLayerTag enumerations represent all the linguistic layers, available in TCF0.4. If you want to develop a service that produces a layer other than the ones listed, please contact us with the suggestion.


In general the approach of the wlfxb library is the following:

* create/access layers from the corresponding TextCorpus/Lexicon object using the following implementations:
**if you are creating the document from scratch you would commonly use TextCorpusStored/LexiconStored implementation and then write the document using WLDObjector and WLData
**if you are reading all the linguistic data from the document you would commonly use WLDObjector, get from it a TextCorpusStored/LexiconStored object and then get the annotation layers from it
**if you are reading only particular annotation layers from the document, you would commonly use TextCorpusStreamed/LexiconStreamed implementation, so that only the layers you request are loaded into the memory

*a layer method cannot be accessed if you didn't read this layer from input or created this layer yourself

*no more than one layer of a given type within the TCF document can be created

*create and add layer annotations by using the methods from the corresponding layer object

*access the annotations of the layer A via that layer A, e.g.:

        // get the first token
        Token token = textCorpus.getTokensLayer().getToken(0);
        // get the first sentence
        Sentence sentence = textCorpus.getSentencesLayer().getSentence(0);
        // get the first entry
        Entry entry = lexicon.getEntriesLayer().getEntry(0);
        // get the first frequency annotation
        Frequency freq = lexicon.getFrequenciesLayer().getFrequency(0);   

*access the annotations of the other layer B referenced from the layer A via this layer A as well, e.g:

        //get all the tokens of the sentence
        Token[] tokens = textCorpus.getSentencesLayer().getTokens(sentence);
        //get tokens of the part-of-speech annotation
        Token[] tokens = textCorpus.getPosTagsLayer().getTokens(pos);
        //get entry of the frequency annotation
        Entry entry = lexicon.getFrequenciesLayer().getEntry(freq)
        //get entries of the part-of-speech annotation
        Entry[] entries = lexicon.getPosTagsLayer().getEntries(pos);
        
*it is necessary to close TextCorpusStreamed object after finishing reading/writing the required annotations

*apply WLDObjector for writing the object in TCF format if you work with TextCorpusStored/LexiconStored object, and if you are passing the streams to the WLDObjector methods, you should close the streams yourself.</rev>
        </revisions>
      </page>
      <page pageid="132" ns="0" title="References Identifier Service">
        <revisions>
          <rev xml:space="preserve">=== Introduction===

This tutorial presents a workflow for creating a webservice for TCF processing. It imitates reference identifier service. The service processes POST requests containing TCF data with tokens, part-of-speech and named entity annotation layers. It uses processes these annotations to produce reference annotations.


This web-service imitates the case when its processing tool object requires the model for identifying the references. Since a model can consume much memory and/or require much time when loading, the tool instance is created only once (the corresponding model is loaded only once), when the application is created. The example shows the case when the tool is thread-safe, it can be shared among the clients without any synchronization.

=== Prerequisites===

The tutorial assumes you have the following software installed:

* [http://netbeans.org/ NetBeans IDE 7.2.1]
* wget or curl command line tool (optional)

=== Adding Clarin Repository===
The example WebLicht Service is provided as Maven Archetype stored in Clarin Repository. Therefore, you'll need to add Clarin Repository to your list of Maven Repositories. Skip this step if Clarin Repository is already among your Maven Repositories. 


In NetBeans IDE, go to the list of Maven Repositories under the Services tab:

[[File:maven-repositories-service.png]]

Right-click on &quot;Maven Repositories&quot; and select &quot;Add Repository&quot; option. Fill in the following information in the &quot;Add Repository&quot; window:

* Repository ID: clarin
* Repository Name: Clarin Repo
* Repository URL: http://catalog.clarin.eu/ds/nexus/content/repositories/Clarin/

Finish by pressing &quot;Add&quot;

[[File:adding-clarin-repository.png]]

&lt;!-- This step is not necessary in NetBeans 7.2.1, I am commenting it --&gt;
&lt;!-- In &quot;Maven Repositories&quot; Service expand the &quot;Clarin Repo&quot; you've just added. You'll see a list of available artifacts.
Find a group named &quot;eu.clarin.weblicht&quot;. In this group there is an artifact called &quot;weblicht-toksentences-ws-archetype&quot;.
Download this artifact by right clicking its most recent version number and selecting &quot;Download&quot; in the menu.

[[File:download-archetype.png]]
 --&gt;

=== Creating a Project from an Archetype ===

Once the Clarin Repository is accessible, we can start using the archetype at once.
Press the &quot;New Project&quot; button in the menu bar and select: Maven -&gt; Project From Archetype

[[File:new-project-from-archetype.png]]

In the next screen find and select &quot;WebLicht References Webservice Archetype&quot;

[[File:select-archetype-references.png]]

Provide a name for your project, a directory to store it in as you would normally do with any NetBeans project. In addition,
you have a possibility to provide a group name for your maven artifact and a package name you would like to use.

[[File:project-name-location.png]]

That's it! You have just created a WebLicht webservice.

[[File:references-project-created.png]]

=== Testing Webservices ===

To test the service, run it on your local server. Right-click on the project and select &quot;Run&quot; option. In the next screen select Tomcat server and click OK button.

[[File:localserver-deploy.png]]


The most straightforward way to test a webesrvice is to use wget or curl command line tool. For example, to POST to the service TCF data from &quot;input.xml&quot; and display the output of the service in the terminal window, run curl:


&lt;nowiki&gt; curl -H 'content-type: text/tcf+xml' --data-binary @input.xml -X POST &quot;http://localhost:8080/mywlproject/annotate/&quot;&lt;/nowiki&gt;


Or wget:

&lt;nowiki&gt; wget --post-file=input.xml --header='Content-Type: text/tcf+xml' &quot;http://localhost:8080/mywlproject/annotate/&quot;&lt;/nowiki&gt;


Make sure you actually have in the current directory a file named &quot;input.xml&quot; in TCF0.4 format containing tokens, part-of-speech and named entity annotation layers. Such a file, provided for testing, is located under &quot;Web Pages&quot; in your project, just copy it to your current directory.

=== What's next? ===

Of course you would probably like to customize the provided code. Let's take a look at the files we have in the project:


* ReferencesService.java


:Here, the application definition resides, use it to define the path to your application and/or add more resources. In this example, a resource ReferencesResource is added as Singleton resource. It means that only one instance of the resource will be created for the application.


* ReferencesResource.java


:This is the definition of a resource, in case more resources are required you can use it as a template for any further resources (don't forget to add them to the ReferencesService.java). Since the resource is registered as a singleton resource, only one its instance is created per application. The resource initializes a TextCorpusProcessor tool used for processing (in this case ReferencesTool object) in its constructor, so that only one instance of the tool is created per application as well. This is useful when the tool used for processing consumes much memory and/or requires much time when loading. Annotated with @POST resource method processes client requests containing TCF input and sends response to the clients with the TCF output. For that, it initializes TextCorpusStreamed object requesting the layers of interest and uses the ReferencesTool object to identify the references and create reference annotations in TCF. It also takes care about catching exceptions and sending the HTTP error code with short cause message in case an exception occurs during the processing.


* ReferencesTool.java


:Here, an actual implementation of a tool resides. In this template an imitation of reference detector is provided. In case you are writing a web service wrapper for already existing tool, here is where you would call your tool, translating input/output data from/into TCF format. Here, the wlfxb library can be of a help, as used in this resource implementation. In this example, the tool loads an imitation of a model in its constructor method. The tool provides process() method that takes TCF document with the layers of interest, uses the loaded model to identify the references in the document, and adds the identified references as a new annotation layer to the TCF document. This example imitates the thread-safe implementation of the tool. It means that client requests can share the same tool objects and no synchronization is required to call the tool process() method.</rev>
        </revisions>
      </page>
    </pages>
  </query>
</api>