POS Tagger

From WebLichtWiki

Jump to: navigation, search

Part-of-speech (POS) seems to occur in every natural language. The usual categories are: noun, verb, article, adjective, pronoun, preposition, adverb, conjunction, etc. Sometimes, by POS, morphological and syntactic classes are also meant. Part-of-speech tagging is the process of automatically assigning (or "tagging") each word in a tokenized text with its corresponding part of speech. These word class tags normally come from a list of part of speech tags (called tag sets) compiled for a particular language. Examples of widely used tag sets are STTS (Stuttgart-Tübingen Tag Set, for German), and the Penn Treebank tag set (for English). Part-of-speech taggers must take into account the meaning of a word, its position in a sentence, and its relation to surrounding words. Some examples of part of speech taggers: 1. TreeTagger http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ The input for this tagger should be text where each token is on a separate line 2. Bohnet-Tagger (Part of the Mate-Tools) http://code.google.com/p/mate-tools/ The both the input and output for this tagger are in the CONLL format (a structured text format). 3. moot http://www.ling.uni-potsdam.de/~moocow/projects/moot/ This is a set of part-of-speech tagging utilities. The plain text to be tagged must first tokenized (the tokenizer included in the utilities). The output of the tokenizer, as well as a training model are then used as input to the tagger. This set of utilities uses well-defined formats (both structured text and XML) for representing the tokenization and tagging information. 4. Stanford POS-Tagger http://nlp.stanford.edu/software/tagger.shtml Both input and output for this tagger can be in either structured text format or XML.