The New WebLicht-Lexicon Format

From WebLichtWiki

(Difference between revisions)
Jump to: navigation, search
(Created page with "(under development) == Current state of affairs == === Lexicon providers in WebLicht === Currently (December 2012) there are two "lexicon providers" in WebLicht: * Wortsch...")
 
(Replaced content with "This page has been relocated to: http://de.clarin.eu/mwiki/index.php/The_New_WebLicht-Lexicon_Format")
 
Line 1: Line 1:
(under development)
+
This page has been relocated to:
  
== Current state of affairs ==
+
http://de.clarin.eu/mwiki/index.php/The_New_WebLicht-Lexicon_Format
 
+
=== Lexicon providers in WebLicht ===
+
 
+
Currently (December 2012) there are two "lexicon providers" in WebLicht:
+
 
+
* Wortschatz Leipzig
+
* BBAW
+
 
+
Both take an XML document in the form of the current Lexicon.rnc as input. Such a document must contain a list of lemmas:
+
 
+
<pre>
+
<D-Spin xmlns="http://www.dspin.de/data" version="0.4">
+
  ...
+
  <Lexicon xmlns="http://www.dspin.de/data/lexicon" lang="de">
+
    <lemmas>
+
      <lemma ID="l1">lieben</lemma>
+
      <lemma ID="l2">rechnen</lemma>
+
    </lemmas>
+
  </Lexicon>
+
</D-Spin>
+
</pre>
+
 
+
In the output, these services can add one or several of the following elements:
+
 
+
* frequencies
+
 
+
<pre>
+
    <frequencies>
+
      <frequency lemID="l1">25</frequency>
+
      <frequency lemID="l2">150</frequency>
+
    </frequencies>
+
</pre>
+
 
+
* POStags
+
 
+
<pre>
+
    <POStags tagset="foo">
+
      <tag lemID="l1">bar</tag>
+
      ...
+
    </POStags>
+
</pre>
+
 
+
* word-relations
+
 
+
<pre>
+
<word-relations>
+
      <word-relation type="cooccurrence" func="sentence" freq="-1">
+
        <sig measure="sig">9.4</sig>
+
        <term lemID="l1"/>
+
        <term>ich</term>
+
      </word-relation>
+
      ...
+
</word-relations>
+
</pre>
+
 
+
=== Current state of Lexicon.rnc ===
+
 
+
As described above, Lexicon.rnc in the current state allows for a list of lemmas and for the aforementioned properties (frequencies, POStags, word-relations).
+
 
+
Lexicon.rnc in its current version (0.4) can be obtained from:
+
 
+
http://clarin-d.de/images/weblicht-tutorials/resources/tcf-04/schemas/latest/d-spin_0_4.rnc
+
 
+
http://clarin-d.de/images/weblicht-tutorials/resources/tcf-04/schemas/latest/lexicon_0_4.rnc
+
 
+
=== Limitations ===
+
 
+
* "lemmas" is the only type of a lexical entries
+
* properties are limited to frequencies, POStags and word-relations
+
 
+
== Motivation: Integrating larger lexical databases into WebLicht ==
+
 
+
=== dlexDB ===
+
 
+
http://dlexdb.de is a lexical database for psychological research (eye tracking in reading) and for general linguistics.
+
 
+
dlexDB offers properties for many different types of lexical entries:
+
 
+
; Lexical elements: Types, lemmas, downcased types (downcased for case normalization), annotated types (types with POS tags)
+
; Sublexical elements: Syllables, characters, character n-grams (n <= 3)
+
: Soon to come: morphemes
+
; Superlexical elements: Type n-grams (n <= 3), downcased type n-grams, annotated type n-grams
+
 
+
Example properties:
+
 
+
* various frequencies
+
* for types:
+
** syllabification
+
** soon to come: morphology
+
** measures of orthographical uniqueness
+
* for type n-grams:
+
** conditional probability
+
 
+
There are about '''120 different properties''' across all tables. Many properties, however, occur in more than one table (e.g., type trigrams can have all properties of a type on each of their three components). Thus the actual number of different properties is around '''750'''. Furthermore, since many numeric properties are available in different scalings, some of which really must be precomputed server-side (e.g., frequency rank), the actual number of uniquely named properties is in the '''thousands'''. All of them can be fetched or referred to in filter conditions.
+
 
+
dlexDB currently has '''2 instances''' (databases/schemas) identical in structure, containing '''17 tables each'''. The two instances have been generated from different corpus bases. As a consequence, they differ with respect to lexical inventory as well as frequencies and other measures.
+
 
+
== The New D-Spin/Lexicon XML Format ==
+
 
+
must provide for
+
 
+
* different types of lexicons
+
** current version of Lexicon.rnc (0.4): only lemmas
+
** needed: types, type n-grams, syllables... (all dlexDB lexicon types, see above)
+
** '''''solution:''''' generic XML-Element <code><entries></code> with attribute <code>lexiconType</code>
+
* a large and extensible set of properties
+
** current version of Lexicon.rnc: only elements <code><frequencies></code>, <code><POStags></code> and <code><word-relations></code> (the latter is for encoding relations involving more than one lexical entry)
+
** needed: a generic XML encoding for all properties (except for the word-relations) such that services can offer new properties without updating the D-Spin/Lexicon format
+
** '''''solution:''''' generic XML-Element <code><properties></code> with attribute <code>name</code> denoting which property is being encoded
+
 
+
=== Entries ===
+
 
+
Instead of the <code><lemmas></code> element in the current version of Lexicon.rnc, we propose a generic <code><entries></code> element with attribute <code>lexiconType</code>.
+
 
+
Example entries of type <code>syl</code> (syllables):
+
 
+
<pre>
+
<entries lexiconType="syl">
+
      <entry ID="e_0">ge</entry>
+
      <entry ID="e_1">grü</entry>
+
      <entry ID="e_2">fen</entry>
+
</entries>
+
</pre>
+
 
+
dlexDB also has tables where entries are more complex. For example, in the '''annotated types''' table, frequencies are given specifically for each unique combination of an orthographical form, the POS tag assigned to it and the lemma assigned to it. Thus an entry in this table is a triple (type, POS tag, lemma):
+
 
+
<pre>
+
<entries lexiconType="typposlem">
+
      <entry ID="e_0">
+
              <typ>langen</typ>
+
              <pos>ADJA</pos>
+
              <lem>lang</lem>
+
      </entry>
+
      <entry ID="e_1">
+
              <typ>fiel</typ>
+
              <pos>VVFIN</pos>
+
              <lem>fallen</lem>
+
      </entry>
+
      <entry ID="e_2">
+
              <typ>erklärt</typ>
+
              <pos>VVPP</pos>
+
              <lem>erklären</lem>
+
      </entry>
+
</entries>
+
</pre>
+
 
+
For '''type n-grams''', an entry consists of several units:
+
 
+
<pre>
+
<entries lexiconType="tt">
+
      <entry ID="e_0">
+
              <unit>wie</unit>
+
              <unit>das</unit>
+
      </entry>
+
      <entry ID="e_1">
+
              <unit>weil</unit>
+
              <unit>er</unit>
+
      </entry>
+
      <entry ID="e_2">
+
              <unit>sich</unit>
+
              <unit>von</unit>
+
      </entry>
+
</entries>
+
</pre>
+
 
+
Units and (type, POS tag, lem) triples can even be '''combined''':
+
 
+
<pre>
+
<entries lexiconType="bigram">
+
      <entry ID="e_0">
+
              <unit>
+
                      <typ>wie</typ>
+
                      <pos>kokom</pos>
+
                      <lem>wie</lem>
+
              </unit>
+
              <unit>
+
                      <typ>das</typ>
+
                      <pos>art</pos>
+
                      <lem>d</lem>
+
              </unit>
+
      </entry>
+
</entries>
+
</pre>
+
 
+
Our current proposal is that Lexicon.rnc should only declare that an <code><entry></code> element must consist of
+
 
+
* either only text for simple entries
+
* or only subelements like <code><unit></code>, <code><typ></code>, <code><pos></code>, <code><lem></code> above (with text embedded deeper inside of them),
+
 
+
but the actual choice of these subelements should not be determined by Lexicon.rnc and left to the lexical services like dlexDB and others.
+
 
+
=== Properties ===
+
 
+
Our proposal is that the existing two elements <code>frequencies</code> and <code>POStags</code> should be dropped/marked deprecated in favor of a new generic element <code><properties></code> with a <code><name></code> attribute:
+
 
+
<pre>
+
<properties name="typ_freq_abs">
+
      <property entryID="e_0">17</property>
+
      <property entryID="e_1">23</property>
+
      <property entryID="e_2">4711</property>
+
</properties>
+
</pre>
+
 
+
== dlexDB in a WebLicht chain ==
+
 
+
dlexDB has two query modes: filter query and list query. These two query modes fit in at different positions within a WebLicht chain:
+
 
+
* The filter query starts with no input document and can therefore only be the starting point of a WebLicht chain.
+
 
+
* The list query requires a D-Spin/Lexicon document with a list of <entries> already present, and adds one or several <properties> elements to the document on request. A list query could be applied several times consecutively if properties are to be fetched from different services.
+
 
+
=== Filter query ===
+
 
+
A filter query starts with '''no input document'''. The query points a one of dlexDB's '''tables''', optionally adding '''filter conditions''', optionally adding a '''sorting''' directive, optionally adding a '''limit''' (max. number of entries to return), optionally adding an '''offset''' (e.g. for paging), and optionally adding a '''list of properties''' that you would like to receive together with the result set of entries, if desired. What you will recieve, then, is a D-Spin/Lexicon document with a list of <code><entries></code> and, if requested, one or several instances of <code><properties></code> elements (with different name attribute values).
+
 
+
==== API ====
+
 
+
preliminary URL example:
+
 
+
<pre>http://alpha.dlexdb.de/sr/wl_filter/kern/typ/?select=typ_freq_abs,typ_syls_cit&orderby=typ_freq_abs desc&top=20&skip=20
+
</pre>
+
 
+
; <nowiki>http://alpha.dlexdb.de/sr/wl_filter/</nowiki>: Base URL
+
 
+
; kern/: Database name (within dlexDB)
+
 
+
; typ/: Table/lexicon name
+
 
+
; <nowiki>?select=typ_freq_abs,typ_syls_cit</nowiki>: Properties selection (optional)
+
 
+
; &orderby=typ_freq_abs desc: Ordering (optional)
+
 
+
; &top=20: Limit (optional)
+
 
+
; &skip=20: Offset (optional)
+
 
+
This URL is meant to be '''POST'''ed to. The '''request body''' may be empty or may be of type '''text/plain''', containing a '''filter expression''' to restrict the result set, e.g.:
+
 
+
<pre>
+
typ_freq_abs ge 100 and typ_inf_abs ge 100
+
</pre>
+
 
+
The filter expression must follow the syntax suggested by the '''OData''' standard (http://www.odata.org/documentation/uri-conventions#FilterSystemQueryOption). The properties mentioned in the expression must relate to database columns in the dlexDB table that is being queried.
+
 
+
Note that OData does not expect filter conditions in a POST request body. Rather, an OData query is supposed to be a GET request with the filter expression submitted as the value of the <code>filter</code> URL query parameter. For dlexDB, we are in the process of developing a fully (rather, as-good-as-possible) OData-compliant API as our base API. For WebLicht-Chaining, then, we will provide a wrapper that accepts POST-requests with the filter expression in the POST-body instead of a <code>filter</code> URL parameter. The reason for this is that in a WebLicht service description, all URL parameters must be listed together with a fixed set of possible values. Therefore, a <code>filter</code> URL parameter with arbitrary user input as value is not possible within the WebLicht context.
+
 
+
==== Problem with the WebLicht Chaining ====
+
 
+
In the WebLicht context, a service description is supposed to describe
+
 
+
* what WebLicht features are required to be present in the input profile in order for the service to be a legal next step in the processing chain. For key=val features, a service may specify a set of legal choices for the value. In a profile, a key=val feature may only have one atomic value. Note that there are also valueless features like e.g. <code>text</code>, or <code>tokens</code>.
+
 
+
* what WebLicht features will be present in the output profile if processing succeeds. However, the service description allows to specify dependencies between output and input features of the following kind:
+
:: An output key=val feature can be specified to be present if and only if a certain input key=val feature is present.
+
: Thus, for a given profile and a service it should in principle be possible to determine which output features would be present if the service would be applied to the profile. However, a service description can specify input key=val features to be added (or overridden???) via URL query parameters (webargs). (It is not clear to me whether such features/webargs are considered required or optional.)
+
:: This leads to the conclusion that the actual output features, at least those of the key=val form, can ultimately only be determined based on
+
:::* the input profile features '''and'''
+
:::* the actual user settings (webargs) sent to the service at query time.
+
 
+
Why are we discussing this? Because we are wondering how to describe which features will be output by the dlexDB filter query service for the WebLicht chaining.
+
 
+
We have introduced the generic <code><properties name="PROPERTY_NAME></code> container for outputting any of the more than 1000 dlexDB properties (including scalings).
+
 
+
Together with Alex Kislev we have developed the idea that the presence of such an element
+
 
+
<pre><properties name="PROPERTY_NAME></pre>
+
 
+
in a D-Spin/Lexicon document should correspond to a feature named
+
 
+
<pre>properties__PROPERTY_NAME</pre>
+
 
+
in the profile. (The feature name consists of the element name, <code>properties</code>, two underscores, and the value of the <code>name</code> attribute.)
+
 
+
This would allow a different service to have, e.g., <code>properties__typ_freq_abs</code> (non-normalized type frequency) as a required input feature in its service description.
+
 
+
Within the current possibilities of a WebLicht service description, we don't know how to specify that our service is able to add '''''zero, one or several requested features (out of a set of more than 1000) to the document depending on a webarg specifying which features to add'''''.
+
 
+
Please note that only a subset (up to several hundreds) of these properties is available for a given table, and that there are 17 tables per database and 2 databases. Given the URL design given above, where database and table selection are part of the URL path (<code>http://alpha.dlexdb.de/sr/wl_filter/kern/typ/</code>), that would make 34 separate services in the weblicht context, since the fixed part of the URL (everything except for the query parameters) is always tied to one service in the service description.
+
 
+
=== List query ===
+
 
+
For a list query, a D-Spin/Lexicon document with <code><entries lexiconType="LEXICON_TYPE"></code> must already be present as input. Such a document may be the result of a dlexDB filter query or come from another source. The list query allows one to add one or several (additional) <code><properties name="PROPERTY_NAME"></code> elements to the document. A document may be enriched with <code><properties></code> elements in multiple iterations, however, this only makes sense if the different properties need to be fetched from different services (dlexDB and others). Otherwise, it would be more efficient to fetch all desired properties in one rush.
+
 
+
Please note that the <code><entries></code> element is supposed (required???) to have a <code>lexiconType</code> property, and that the service that you want to approach must be prepared to handle lexicon entries of the given type. In dlexDB, <code>lexiconType</code> corresponds to a table name, which is part of the service's URL.
+
 
+
Given the current WebLicht rules, an element
+
 
+
<pre>
+
<entries lexiconType="LEXICON_TYPE">
+
</pre>
+
 
+
corresponds to a WebLicht profile feature of the form
+
 
+
<pre>
+
entries.lexiconType=LEXICON_TYPE
+
</pre>
+
 
+
Using the <code>entries.lexiconType</code> feature and its value, a matching between a given profile and a service can be established.
+
 
+
==== API ====
+
 
+
preliminary URL example:
+
 
+
<pre>http://alpha.dlexdb.de/sr/wl_list/kern/typ/?select=typ_freq_abs,typ_syls_cit
+
</pre>
+
 
+
; <nowiki>http://alpha.dlexdb.de/sr/wl_filter/</nowiki>: Base URL
+
 
+
; kern/: Database name (within dlexDB)
+
 
+
; typ/: Table/lexicon name
+
 
+
; <nowiki>?select=typ_freq_abs,typ_syls_cit</nowiki>: Properties selection (required)
+
 
+
This URL is meant to be *POST*ed to. The request body must be of type text/xml+lexicon, containing a D-Spin/Lexicon XML document containing an <code><entries></code> element with a <code>lexiconType</code>.
+
 
+
The only allowed and required parameter is the <code>select</code>ion of which properties are supposed to be added to the document by the service.
+
 
+
==== Problem with the WebLicht Chaining ====
+
 
+
The problem with the WebLicht chaining here is the same as in the case of the filter query, see above.
+

Latest revision as of 11:37, 13 December 2012

This page has been relocated to:

http://de.clarin.eu/mwiki/index.php/The_New_WebLicht-Lexicon_Format