Parser

From WebLichtWiki

Revision as of 16:17, 7 March 2012 by Kbeck (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

In linguistics, syntax is the study of sentence structure, that is the arrangement of the tokens (e.g. words) in a sentence. Various syntactic theories propose different approaches to the way syntactic structure is modelled. A syntactic theory defines a set of principles known as grammar. A grammar assigns syntactic categories to parts of a sentence (syntactic units) and mandates relations between them. Parsing is a process (either automated or manual) of performing syntactic analysis of a text, that is assigning syntactic categories to parts of a sentence and marking relations between them. Automated parsers often require some pre-processing of the input text. The most common pre-processing includes tokenization, part-of-speech tagging, and sentence detection. In the output, additional annotations are added to each sentence representing its syntactic structure. This annotation marks the boundaries of syntactic units, their category and their relations to other syntactic units. Currently, two types of automated parsing tools are in broad use: constituent and dependency parsers. Each assumes a different model of syntactic structure and is arguably more appropriate for a particular set of languages.

  • Constituent Parser

Constituent parsers are automated tools for syntactic analysis based on constituency grammars. Constituency grammars view sentence structure in terms of the constituency relation among its parts. Tokens form basic constituents from which other constituents are constructed. At the end, each sentence is seen as a single constituent split into smaller constituents thus forming a hierarchical structure known as tree. In this tree structure, each node is a constituent (e.g. syntactic unit) having a syntactic category assigned to it according to the grammar used. The number of syntactic units (e.g. nodes in this tree) in this analysis is always greater than or equal to the number of tokens in the sentence. In the constituent parser output, constituents are marked within the hierarchy and annotated with their syntactic categories.

  • Dependency Parser

Dependency parsers analyze syntactic structure according to dependency grammars. In contrast to constituency grammars, dependency grammars focus on syntactic relations between tokens. Thus only a token can constitute a syntactic unit. Syntactic units are connected by so-called dependency relations in which each token is dependent on (e.g. connected to) another token. In this way, a hierarchical structure is formed in which the number of syntactic units is equal to the number of tokens in the sentence. In the dependency parser output, dependency relations of various types are marked between tokens.