Creating a WebLicht Web Service

This tutorial is designed to assist developers who would like to create and integrate an NLP tool into WebLicht. The instructions below will guide you through the process, including creating a WebLicht web service project, building it, and explaining how to modify it to create a real service. When the service is ready for production it can then be registered for use in WebLicht.

1 Introduction
2 Web Service Tutorial
3 Integration into WebLicht

Introduction

A WebLicht service is simply a synchronous REST-style web service. The client establishes a HTTP connection to the service and initiates a POST request containing the input data. Depending on the service, the client can also set various processing parameters using the query string in the URL. The service processes the input data synchronously and returns the result as output data, using the same HTTP connection.

In order for the WebLicht services to be interoperable with each other, a common machine-readable XML-based exchange format called TCF (Text Corpus Format) is used. A typical WebLicht service processes annotation layers in input TCF to produce new linguistic annotations, it outputs TCF with these annotation added as new layers. The WebLicht tool-chaining architecture imposes a few restrictions on services TCF output. A Weblicht service is not permitted to change the linguistic annotations in the input TCF, it can only add new linguistic annotation layers to the document. A WebLicht service is also not allowed to add an annotation layer that already exists in the input document.

Web Service Tutorial

This tutorial will guide you in creating a WebLicht web service. You will create a project from an archetype, or template, and make the necessary changes to get it working for your tool. The project comes with some basic functionality that can easily be used as the basis for your own project.

Prerequisites

Java 6 or above
maven 2.0.10 or above, for building the project and managing dependencies
wget or curl to invoke the service from the command line
rpm for building the service as a linux package
- maven, wget, curl, and rpm can all be easily installed by most package managers, such as brew (for mac) or apt-get (for linux). Package managers are also available for Windows.
An IDE such as IntelliJ or NetBeans for editing the project source code
Internet access for downloading the starter project and testing the service

Getting Started: Create a Project

The starter project is for a WebLicht web service that uses TCF for both input and output. The archetype is available from the Maven central repository. To create a project from the archetype, navigate to the directory where you want to store the project on the command line, then use the Maven command line tool (mvn) to create a project from the archetype:

From the command line, navigate to the directory where you want to create the project, and use the maven command to download and set up the project. You can customize the following parameter values in the command:

groupId: the java package to be created
artifactId: the name of the project

You will be asked a few questions during the creation of the project.

Use the default value for version (just hit return).
Use anything for serviceGroup and serviceUser, (e.g. sdemo) (these are properties needed to install the service on linux).
Confirm your choices by hitting return at the Y: : prompt

mvn archetype:generate -DarchetypeGroupId=eu.clarin.weblicht -DarchetypeArtifactId=weblicht-nentities-ws-dw-archetype -DarchetypeVersion=1.4 -DgroupId=my.org.weblicht -DartifactId=my-service

Open the newly created project in your IDE.

What Does the Demo Project Do?

The starter project contains three simple webservices (tokenization/sentence splitting, references identifier, named entity recognizer), each demonstrating different ways of implementing a service. Which one you use as the basis for your project depends on the underlying tool and will be discussed later.

An endpoint of a service is the URL at which an operation is provided. In this case, the operations are NLP tools. This project exposes the following endpoints:

Tokenization and sentence splitting:

/toksentence/bytes
/tokenstence/stream

References identifier:

/refidf/bytes
/refidf/stream

Named entity recognition:

/ner/bytes
/ner/stream

Build and Deploy the Project

Navigate to the root directory of your newly created project.

Build the project to make a runnable jar:

mvn clean package

You should now see that a directory called target was created, containing a filed called my-service-1.0-SNAPSHOT.jar. If you used a different artifactId when creating the project, the name will be reflected in the name of the jar file.

Deploy the new service locally:

java -jar target/my-service-1.0-SNAPSHOT.jar server

Once the application is started, it can be accessed using the following URL in your browser:

http://localhost:8080/

That page contains descriptions of the demo services. Follow the instructions there to test the endpoints of the services.

Choosing a Template Service

This demo covers three scenarios:

Resources or models are not required, or are small - use the tokenization example
Resources are required that are large or time-consuming to load
- The underlying tool is thread-safe - use the references example
- The underlying tool is not thread-safe - use the named entity example

In addition, each example also demonstrates two ways of returning the TCF output in the HTTP response: as a streamed output (recommended) and as an array of bytes. Streaming is only possible for output, not for input.

Using a bytes array in the implementation is simple, but the entire output is held in memory at once, which could cause the server to run out of memory if the output is large.

Using streaming output in the implementation is slightly more complicated, but the output is streamed and only a part of the it is held in memory at any given time. This makes it possible to handle large output, and is the recommended way.

Exploring the Code

Open the project in your favorite IDE and have a look at the code. We'll start with a brief overview of each service and then we'll see what modifications need to be made.

Overview: Tokenization Example

This web service is an example of a simple tokenization and sentence boundaries detection service. The service processes POST requests containing TCF data with text.

This web service demonstrates the case when the underlying processing tool object is not expensive to create and it does not consume a much memory. In this a case it is convenient to create the tool object with each client POST request, as shown in the service implementation. You don't need to worry about whether the underlying tool is thread-safe or not.

Overview: References Example

This web service is an example of a simple service that creates reference annotations based on previously annotated named entities. The service processes POST requests containing TCF data with tokens, PoS tags, and named entities.

This web service demonstrates the case where the underlying processing tool object requires a model for identifying the references. Since a model can consume much memory and/or require much time when loading, the tool instance is created only once (the corresponding model is loaded only once), when the application is created. The example shows the case when the tool is thread-safe, it can be shared among the clients without any synchronization.

Overview: Named Entity Recognition Example

This web service is an example of a simple named entity recognizer service. The service processes POST requests containing TCF data with tokens. It uses token annotations to produce named entity annotations.

It demonstrates the case where the tool must load a list of named entities which is used to identify them. This is a common use case where a tool uses resources such as a list, database, index, a machine learning model, etc. under the hood. Such resources can consume much memory and/or require much time when loading. Therefore, it is better to create only one instance (or certain restricted number of instances) of the tool per application. In this example, the tool instance is created only once (i.e. the corresponding list resource is loaded only once), when the application is created. In this example, the tool is not thread-safe. Therefore, the tool's process() method requires synchronization.

Code Organization

The java source code is organized as follows:

default package

DemoApplication.java

- The main entry point of the application

DemoConfiguration.java

- Needed by the dropwizard framework

core package

<Sample>Tool.java

- This is where the implementation of the tool takes place. The Tool class includes a getRequiredLayers method that returns the annotation layers that are required in the input data. The process method receives a TextCorpus object containing the required layers and adds annotation layers to it. The TextCorpus class comes from the wlfxb library for reading/writing TCF.

resources package

This is where the endpoints are declared and implemented. If a class is added here, it should also be registered in DemoApplication.java

IndexResource.java: - Implements logic for the landing page (as seen at, e.g. localhost:8080). There is not much going on here except serving up pages and files.
StreamingTempFileOutput.java: - A utility class for writing streamed output
<Sample>Resource.java: - Implements the service endpoint(s). For @POST requests, it reads the required TCF annotation layers from the input data into a TextCorpus object and passes it to the Tool's process method. If an error occurs, an HTTP error code is returned, along with an message indicating the cause.

src/main/resources

Resource files (models, lists, etc) for the service can be stored here.

index.html: - The service landing page, which should be replaced with the real service description and testing scenario.

assembly/conf/service-config.yaml: Service configuration file for parameters such as port number and log file location. During the testing phase, log files are written under the target directory. Use the Logger class to write messages to the log files.

Tool Instances

A service can create a single instance of the <Sample>Tool object to be used over and over again for each POST request, or it can create a new <Sample>Tool object for each POST request.

To see how this is done, have a look at the run method in DemoApplication.java. One single instance of a NamedEntitiesResource is created and registered. In contrast, no instance of a TokSentencesResource is created. Instead, the class definition is registered, and a fresh instance is created each time.

NamedEntitiesResource namedEntitiesResource = new NamedEntitiesResource();
environment.jersey().register(namedEntitiesResource);

environment.jersey().register(TokSentencesResource.class);

Error Handling

Error handling must be done with care when streaming output. When the implementation uses StreamingOutput, it writes output to a temporary file before writing to the streaming output. This is necessary step in order to catch errors that can occur during TCF reading/processing/writing. Otherwise, if the error occurs when the server is in the process of streaming the data to the client, the server response will be confusing: HTTP status is sent as OK (because it is sent when the streaming starts and before the error occurs), while output itself would contain half-written not finished TCF data. This will break the WebLicht chain processing. In case of errors WebLicht needs to be notified with corresponding error HTTP status code and status massage. On the other hand, if writing to the temporary file was successful and no errors occur during the processing, then the streaming output can be used to return data from the temporary file to the client.

Cleanup

Don't forget to clean up the code before putting it into production.

Delete all unused resources, e.g.:
- src/main/resources/input_ner.xml
- src/main/resources/input_ref.xml
- src/main/resources/input_tok.xml
Rename classes to reflect their purpose
Delete any classed defined for the demo, but no longer needed

Integration into WebLicht

Once a service has been tested and deployed on a public server, it can be made available for use in WebLicht. Metadata describing the service must be created and made available in one of the CLARIN center repositories.

More information on these topics can be found at:

Help can be found at: wladmin -at- sfs.uni-tuebingen.de