WebLicht in Detail
WebLicht is a service orchestration and execution environment for incremental automatic annotation of text corpora, build upon Service Oriented Architecture principles. Its main components are:
- A set of distributed services for data processing,
- A repository containing metadata about the services
- A web application that offers a user-friendly graphical interface for building chains of services and executing them.
A WebLicht service is simply a synchronous REST-style web service. The client establishes a HTTP connection to the service and initiates a POST request containing the input data. Depending on the service, the client can also set various processing parameters using the query string in the URL. The service processes the input data synchronously and returns the result as output data, using the same HTTP connection.
Currently the WebLicht ecosystem contains around 100 services located in various areas across Europe. The majority of WebLicht services use the TCF file format for input or output, an xml-based format designed for linguistic annotations. A set of web services allow conversions between TCF and most of the other formats used in the linguistic community.
The WebLicht database of services is spread over all the CLARIN-D repositories, currently nine in number, located at academic and research centers in Germany. Each one of these repositories stores metadata for its own services, using CMDI (http://www.clarin.eu/cmdi) as the designated metadata format. The repositories also publicly disseminate the CMDI metadata using the standard OAI-PMH protocol. This metadata is harvested and aggregated by the Tübingen repository service.
There is no central database of WebLicht web services. Any CLARIN-D repository can contain CMDI descriptions of any web service. The Tübingen repository service periodically harvests all the repositories via OAI-PMH and builds a snapshot of the ecosystem, used for serving subsequent requests; however, this is not a database per se.
The Tübingen repository service, in itself a web service but not part of the WebLicht set, has two main functionalities:
- It delivers metadata for the available services on request, containing information about creators, access rights, development status, service description and data input and output formats and specifications. This data IO specification is used for ensuring the correctness of the service chains.
- It allows building of service chains in a type-safe manner. A service chain is a list of services that are executed sequentially, each one receiving as input the output of the previous one. Using the metadata specifications of the input and output data for each service, the system can guarantee the correctness of a service flow from a data perspective.
The WebLicht Web Application
The WebLicht web application is implemented and hosted by SfS Tübingen. The server side is coded in Java and deployed on an Apache Tomcat servlet container. The application allows a user to:
- Upload a text (plain text, MS Word, RTF or PDF files) or to construct a text from various corpora
- Build a chain of services to be applied to the input text
- Execute the tool chain with the input text and present the results in various formats (text, table, tree, free-form html) depending on the output data types.
Orchestration Metadata and Chaining
Each WebLicht web service is described in the CLARIN-D repositories by a piece of metadata. This contains information about the service creators, a brief description, the service registration date, the URL, etc. In addition to this human-oriented metadata there is also the orchestration metadata. The purpose of the orchestration metadata is to describe the profile of the data that a web service accepts and the profile of its output, implicitly defining a type system. This I/O data specification is designed to be as straightforward and generic as possible. It is domain neutral and independent of the actual data format used by the services, but at the same time simple to understand, use and generate.
The orchestration metadata is a list consisting primarily of data features, each feature containing zero or more values. The input description specifies what properties input data must have in order to be correctly processed by the service. The output description specifies properties of the output data that is generated by the service.
For example, in the simple case of a converter tool from a Microsoft Word document to a more convenient format for linguistic processing, the xml-based TCF file format, the data specification would assert that the input must be a Microsoft Office document and the output will be a TCF file with a specific version, a specific language and a text layer (represented in the xml file as a special node). Using the WebLicht orchestration data language, the specification looks like this:
<input> <feature name="type"> <value name="application/msword"/> </feature> </input>
<output new="true"> <feature name="type"> <value name="text/tcf+xml"/> </feature> <feature name="version"> <value name="0.3"/> </feature> <feature name="lang"> <value name="en"/> </feature> <feature name="text"/> </output>
The feature and value name semantics are not formally specified. However, consistency of the feature names is necessary for interoperability of web services within the same domain. The input specification in the example asserts that the "type" feature of the input data must be "application/msword". The "type" feature in this case designates the data mime type.
The output specification declares that the web service output format is TCF (the value of its "type" feature is "text/tcf+xml"); that version 0.3 of this file format is used; that this TCF file will contain a text layer (the notion of layers is specific to TCF); and that the declared language of this text layer is English.
The output specification can operate in two different modes: addition or replacement. The default mode is addition, meaning that the declared output features for the web service are added to the ones the input data already have. This mode is used when web services simply augment the original input data.
In the example above the replacement mode is used as declared by the attribute new="true". This indicates that the web service completely transforms the input data, so that the profile of the output data is determined solely by the declared output features of the service.
This system of data-describing assertions makes chaining possible (even before any services have actually been invoked). Chaining is the process of automatically finding appropriate web services when the characteristics of the data to be used as input are known. Each time a service is selected, the characteristics of the resulting data are recomputed and used to generate a new set of suitable services. The set of the output features in the present example is a description of the resulting piece of data: a version 0.3 TCF file with an English text layer. Automatically finding new services to be applied in the chain is now just a matter of filtering through the list of the available services.