Sentence Splitter

From WebLichtWiki

Jump to: navigation, search

Recognizing sentence boundaries is not an easy task for a computer. Punctuation marks that usually appear at the end of a sentence may not indicate the end of a sentence. For example, a period can end a sentence but it can also be part of an abbreviation or acronym, an ellipsis, a decimal number, etc. A period can even act both as the end of an abbreviation and the end of a sentence at the same time, as in the previous sentence of this text. Tokenizers and sentence boundary detectors normally require plain text as input. The output can be plain text which is formatted in such a way that makes it clear where the token and/or sentence boundaries occur. For example, by placing each token on a separate line. It is also common for these tools to generate output in a more sophisticated format which allows further processing to be done more easily. An example of a sentence splitter is MX Terminator: ftp://ftp.cis.upenn.edu/pub/adwait/jmx/