Tokenizer

From WebLichtWiki

Jump to: navigation, search

A token is an individual unit within a sentence. Tokens are single words, numbers, punctuation marks, etc. Extracting words and sentences are fundamental operations that are required in order to perform further processing of a text. Tokenization is the process of segmenting word tokens in texts. In all modern languages that use Latin, Cyrillic or Greek writing systems, word tokens are recognized by the delimiting blank or punctuation. Numbers, alphanumerics and special format expressions (dates, measures, abbreviations) are also recognized as tokens, traditionally by using regular expressions. Tokenization in non-segmented languages, such as many Oriental languages, requires more sophisticated algorithms (lexical look-up of longest matching sequences, hidden Markov models, n-gram methods and other statistical techniques).