The chaining algorithm
From WebLichtWiki
Each WebLicht web service is described in the CLARIN-D repositories by a piece of metadata. This contains information about the service creators, a brief description, the service registration date, the URL, etc. In addition to this human-oriented metadata there is also the orchestration metadata. The purpose of the orchestration metadata is to describe the profile of the data that a web service accepts and the profile of its output. This I/O data specification is designed to be as straightforward and generic as possible. It is domain neutral and independent of the actual data format used by the services, but at the same time simple to understand, use and generate.
The orchestration metadata is a list consisting primarily of data features, each feature containing zero or more values. The input description specifies what properties input data must have in order to be correctly processed by the service. The output description specifies properties of the output data that is generated by the service.
Before the chaining can begin user data should be profiled to assign it a feature map describing it. Once a profile of the user data has been created the chainer is executed to retrieve the metadata of all the webservices capable of processing the given input. This process is illustrated below:
For example, user uploads a PDF document to WebLicht, the profiler is executed and a profile is generated, in some cases manual specification of some features such as a language is required, i.e. in case that it could not be automatically determined from the input.
Provided a profile of user data, the chainer can be invoked. As a result it produces webservice metadata, which is later presented to the user to make a choice of a webservice to run on the data. In the example below one of the services "chained" is a converter tool from a PDF document to a more convenient format for linguistic processing, the xml-based TCF file format. The orchestration metadata would assert that the input must be a PDF document and the output will be a TCF file with a specific version, a specific language and a text layer (represented as a feature with no values).
The feature and value name semantics are not formally specified. However, consistency of the feature names is necessary for interoperability of web services within the same domain. The input specification in the example asserts that the "type" feature of the input data must be "application/pdf" and its language must be "en", i.e. English. The "type" feature in this case designates the data mime type.
The output specification declares that the web service output format is TCF (the value of its "type" feature is "text/tcf+xml"); that version 0.4 of this file format is used; that this TCF file will contain a text layer (the notion of layers is specific to TCF); and that the declared language of this text layer is English.
The output specification can operate in two different modes: addition or replacement. The default mode is addition, meaning that the declared output features for the web service are added to the ones the input data already have. This mode is used when web services simply augment the original input data.
In the example above the replacement mode is used (it is declared by the replacesInput field in the CMDI metadata of a web service). This indicates that the web service completely transforms the input data, so that the profile of the output data is determined solely by the declared output features of the service.
This system of data-describing assertions makes chaining possible (even before any services have actually been invoked). Chaining is the process of automatically finding appropriate web services when the characteristics of the data to be used as input are known. Each time a service is selected, the characteristics of the resulting data are recomputed and used to generate a new set of suitable services. The set of the output features in the present example is a description of the resulting piece of data: a version 0.4 TCF file with an English text layer. Automatically finding new services to be applied in the chain is now just a matter of filtering through the list of the available services.