Work field 2: Establishment of the technical infrastructure

Participating partners: University of Tübingen, MPI Nijmegen, Institut for the German Language Mannheim, Berlin-Brandenburgische Akademie der Wissenschaften Berlin, University of Leipzig, University of Frankfurt, DFKI Saarbrücken, University of Stuttgart

In Germany, some centres are realized that offer stable and persistent Language Resources and Technology (LRT) services of various kinds. On the one hand, these national services need to be well coordinated with those on a European level, on the other hand, German researchers need to be able to use all the essential services in Germany.

Infrastructure

CLARIN is a technical infrastructure that makes the eScience paradigm possible. This subfield explains what CLARIN and D-SPIN mean by using the term infrastructure. Infrastructure can be briefly described with a comparison with ICE trains: during the development of new ICE high speed trains, new rails, signal systems etc. were needed – in short: a new infrastructure. Infrastructure implies everything that scientists need in order to handle new problems by using available resources and new applications.

Service centres of a new kind

In a second chapter it is shown, that service centres of a new kind are the basis of a stable, integrated and persistent infrastructure. These centres assure high availability as well as high long-term persistence of necessary services on a basis of clear institutional support on the part of the federal government , the federal states and other institutions. Different services need to be distinguished, ranging from long-term storage to offering resource based services like encyclopaediae or ontologies. The document differentiates between “Language Resource Services", that offer data resources of all kinds, "Language Technology Services", that offer tools for execution, "Infrastructure Services", that offer infrastructural services like machine readable registries and translation services of persistent identifiers to physical addresses, and finally advice from experts.

Because of the enormous pressure to publish, scientists will rely only on new methods regarding integrated and interoperable resources , if these resources are sustainable, secure and easily usable. Therefore, we need service centres of a new kind, that work on behalf of scientists.

Language resource federation

A federation is a framework defined by agreements, which enables virtual integration of resources. This is only possible, if users can agree on a limited number of agreements concerning licence. This federation of language resource providers needs to synchronize its agreements with the Authentification and Authorization Infrastructure (AAI) of the DFN union and with what is carried out by libraries. We expect the development of a national Identity federation in Germany, similar to what happened in Finland and Switzerland, where every scientist will be integrated in the near future. The aim should be to incorporate every relevant German resource provider in a federation, so that agreements about services can be signed with the DFN union.

Registries

D-SPIN, alongside CLARIN, is going to support a number of registry services. Most important is the registry of all resources in a format that is machine readable, so that applications can use the information in future work flow systems. On a basis of long-time experience with meta data infrastructures like IMDI, OLAC, Dublin Core, Natural Language Software Registry, that is integrated into LT-World, and header attributes from TEI as well as component schemes like lexical markup framework, D-SPIN is going to participate in creating a new component based registry format, that meets the requirements of various resource types and subdisciplines, without giving up the need for semantic interoperability. This can be guaranteed by using only those concepts that are defined in concept registries. This flexible realisation of a registry allows a broad and extendable coverage of resources, therefore participants shall be obliged to register their resources. There will be a German registry, that will be offered through standard protocols of a European level. It is crucial that this new, split registry will be supported by an infrastructure that supports different methods of access, like it has already been realised with the IMDI infrastructure. This includes the possibility of creating own, virtual, institution-spanning collections as well as supporting different search and browse methods.

Web services

Another aspect of the infrastructure that needs to be realised is the following: D-SPIN needs to show a path away from “download first” towards a proper cyber-infrastructure scenario, in which various components can be used via web services. Germany has already some experience thanks to the prominent contribution of the organisations that participate in D-SPIN in ISO TC37 and in the LIRICS project. Despite this experience, the way towards a cyber-infrastructure scenario won't be easy, because many resources are not in an appropriate state and tools cannot be converted into web services too easily. We will start with those resources and web services offered by participants of D-SPIN, like the increasingly used service for the “German Wortschatz”, conducted in Leipzig (work field 5). Crucial is the development of standards and agreements for the domain of language resources on a basis of definitions from W3C, with which we already cooperate. The aim in the initial stage is to develop and display work flows for some typical production lines, in order to fully understand the complexity of the problem. Especially in this area, we aspire a cooperation with the Text-Grid project.

Basis services and applications

In the initial stage, we want to provide some basis service and applications that show the potential of a cyber-infrastructure scenario. It is desirable, based upon integrated resources, to provide a combined meta data and content search that covers all data resources. Various models need to be investigated, especially regarding questions about licences (work field 7).

We cannot expect all data to be possibly transferred into one centre, therefore we need to implement a search mechanism on split resources. Another scenario arose from a test conducted by the MPI and the DFKI recently: Users who pick one or more data resources should have the opportunity to be able to see after a profile comparison those appropriate software tools that can help them fulfill their tasks. In the course of the project, more applications need to be discussed, that could probably be implemented already in the initial stage. Architecture and code of these applications need to be kept open, so that they can be used in training sessions as examples for a new programming style.