Computational linguistic tools are programs that perform analyses of linguistic data, or assist in performing such analyses. This section will provide an introduction to the general classes of linguistic tools and what purposes they serve.

Many computational linguistic tools, especially the oldest and most widely-used ones, are extensions of pre-computer techniques used to analyze language. Tokenization, part-of-speech tagging, parsing and word sense disambiguation, as well as many others, all have roots in the pre-computer world, some going back thousands of years. Computational tools automate these long-standing analytical techniques, often imperfectly but still productively.

Other tools, in contrast, are exclusively motivated by the requirements of computer processing of language. Sentence-splitters, bilingual corpus aligners, and named entity recognition, among others, are things that only make sense in the context of computers and have little immediate connection to general linguistics but may be very important for computer applications.

Linguistic tools can also encompass programs designed to enhance access to digital language data. Some of these are extensions of pre-computer techniques like concordances and indexes, but can also include more recent developments like search engines. Textual information retrieval is a large field and this document will discuss only search and retrieval tools specialized for linguistic analysis.

Hierarchies of Linguistic Tools

Linguistic tools are often interdependent and frequently incorporate some elements of linguistic theory. Modern linguistics draws on traditions of structuralism, a paradigm and school of thought in the humanities and social sciences dating to the early 20th century. Structuralism emphasized the study of phenomena as hierarchal systems of elements, organized into different levels of analysis, each with their own units, rules,and methodologies. Linguists, in general, organize their theories in ways that show structuralist influences, although many disclaim any attachment to structuralism. Linguists disagree about what levels of analysis exist, to what degree different languages might have differently organized systems of analysis, and how those levels interrelate. However, hierarchal systems of units and levels of analysis are a part of almost all linguistic theories are very heavily reflected in the practices of computational linguists.

Linguistic tools are usually categorized by the level of analysis they perform, and different tools may operate at different levels and over different units. There are often hierarchal interdependencies between tools - a tool used to perform analysis at one level may require, as input, the results of an analysis at a lower level.


Figure 1 is a simplified hierarchy of linguistic units, subdisciplines and tools. It does not provide a complete picture of linguistics and it is not necessarily representative of any specific linguistic school. However, it provides an outline and a reference framework for understanding the way hierarchal dependencies between levels of analysis affect linguistic theories and linguistic tools.

The hierarchal relationship between levels of analysis in Figure 1 generally applies to linguistic tools. Higher levels of analysis generally depend on lower ones. Syntactic analysis like parsing usually requires words to be clearly delineated and part-of-speech tagging or morphological analysis to be performed first. This means, in practice, that texts must be tokenized, their sentences clearly separated from each other, and their morphological properties analyzed before parsing can begin. In the same way, semantic analysis is often dependent on identifying the syntactic relationships between words and other elements, and inputs to semantic analysis tools are often the outputs of parsers. Higher level analyses tend to be dependent on lower level ones.

However, this simplistic picture has many important exceptions. Lower level phenomena often have dependencies on higher level ones. Correctly identifying the part-of-speech, lemmas, and morphological categories of words may depend on a syntactic analysis. Phonetic and phonological analysis can affect with morphological and syntactic analysis. Even speech recognition - one of the lowest level tasks - depends strongly on knowledge of the semantic and pragmatic context of speech.

Furthermore, there is no level of analysis for which all linguists agree on a single standard set of units of analysis or annotation scheme. Different tools will have radically different inputs and outputs depending on the theoretical traditions and commitments of their authors.

Most tools are also language specific. There are few functional generalizations between languages that can be used to develop single tools that apply to multiple languages. Different countries with different languages often have very different indigenous traditions of linguistic analysis, and different linguistic theories are popular in different places, so it is not possible to assume that tools doing the same task for different languages will necessarily be very similar in inputs or outputs.

Corpus and computational linguists often work with written texts, and therefore usually avoid doing any kind phonetic analysis. For that reason, this chapter will not discuss speech recognition and phonetic analysis tools, although some of the multimedia tools discussed in this chapter can involve some phonetic analysis. At the highest levels of analysis, tools are very specialized and standardization is rare, so few classes of very high linguistic level tools are discussed here.

Automatic and Manual Analysis Tools

Although some tools exist to just provide linguists with interfaces to stored language data, many are used to produce an annotated resource. Annotation involves the addition of detailed information to a linguistic resource about its contents. For example, a corpus in which each word is accompanied by a part-of-speech tag, or a phonetic transcription, or in which all named entities are clearly marked, is an annotated corpus. Linguistic data in which the syntactic relationships between words are marked is usually called a treebank. The addition of annotations increases the accessibility and usability of resources for linguists and may be required for further computer processing.

Automatic annotation tools add detailed information to language data on the basis of procedures written into the software, without human intervention other than to run the program. Automatic annotation is sometimes performed by following rules set out by programmers and linguists, but most often, annotation programs are at least partly based on machine learning algorithms and that are trained using manually annotated examples.

Automated annotation processes almost always have some rate of error, and where possible, researchers prefer manually annotated resources with fewer errors. These resources are, naturally, more expensive to construct, rarer and smaller in size than automatically annotated data, but they are essential for the development of automated resources and necessary whenever the desired annotation either has not yet been automated or cannot be automated. Various tools exist to make it easier for people to annotate language data.

Technical issues in linguistic tool management

Many of the most vexing technical issues in using linguistic tools are common problems in computer application development. Data can be stored and transmitted in an array of incompatible formats, and either there are no standards, or too many standards, or standards compliance is poor. Linguistic tool designers are rarely concerned with those kinds of issues or specialized in resolving them.

Many tools accept and produce only plain text data, sometimes with specified delimiters and internal structures accessible to ordinary text editors. For example, some tools require each word to be on separate line, others require each sentence on a line. Some will use comma- or tab-delimited text files to encode annotation in their output, or require them as input. These encodings are often historically rooted in the data storage formats of early language corpora. A growing number of tools use XML for either input or output format.

One of the major axes of difference between annotation programs is inline or stand-off annotation. Inline annotation mixes the data and annotation in a single file or data structure. Stand-off annotation means storing annotations separately, either in a different file, or in some other way apart from language data, with a reference scheme to connect the two.

Character encoding formats are an important technical issue when using linguistic tools. Most languages use some characters other than 7-bit ASCII and there are different standards for characters in different languages and operating systems. Unicode and UTF-8 are increasingly widely used for written language data to minimize these problems, but not all tools support Unicode. Full Unicode compatibility has only recently become available on most operating systems and programming frameworks, and many incompatible linguistic tools are still in use. Furthermore, the Unicode standard supports so many different characters that simple assumptions about texts - what characters constitute punctuation and spaces between words - may differ between Unicode texts in ways incompatible with some tools.

Automatic Annotation Tools

Sentence Splitters

Sentence splitters, sometimes called sentence segmenters, split text up into individual sentences with unambiguous delimiters.

Recognizing sentence boundaries in texts sounds very easy, but it can be a complex problem in practice. Sentences are not clearly defined in general linguistics, and sentence-splitting programs are driven by the punctuation of texts and the practical concerns of computational linguistics, not by linguistic theory.

Punctuation dates back a very long time - at least to the 9th century BCE - but until modern times not all written languages used it. The sentence itself, as a linguistic unit delimited by punctuation, is an invention of 16th century Italian printers and did not reach some parts of the world until the mid-20th century.

In many languages - including most European languages - sentence delimiting punctuation has multiple functions other than just marking sentences. The period (“.”) often marks abbreviations and acronyms as well as being used to write numbers. Sentences can also end with a wide variety of punctuation other than the period. Question marks, exclamation marks, ellipses, colons, semi-colons and a variety of other markers must have their purpose in specific contexts correctly identified before they can be confidently considered sentence delimiters. Additional problems arise with quotes, URLs and proper nouns that incorporate non- standard punctuation. Furthermore, most texts contain errors and inconsistencies of punctuation that simple algorithms cannot easily identify or correct.

Sentence splitters are often integrated into tokenizers, but some separate tools are available. CLARIN- D and WebLicht offer integrated tokenizer/sentence-splitter tools, but no separate sentence-splitting tool at this time.


A token is a unit of language akin to a word but not quite the same. In computational linguistics it is often more practical to discuss tokens instead of words, since a token encompasses many linguistically anomalous elements found in actual texts - numbers, abbreviations, punctuation, among other things - and avoids many of the complex theoretical considerations involved in talking about words.

Tokenization is usually understood as a way of segmenting texts rather than transforming them or adding feature information. Each token corresponds to a particular sequence of characters that forms, for the purposes of further research or processing, a single unit. Identifying tokens from digital texts can be complicated, depending on the language of the text and the linguistic considerations that go into processing it.

In modern times, most languages have writing systems derived from ancient languages like Phoenician and Aramaic used by traders in the Middle East and on the Mediterranean Sea starting about 3000 years ago. The Latin, Greek, Cyrillic, Hebrew and Arabic alphabets are all directly derived from a common ancient source, and most historians think the writing systems of India and Central Asia come from the same origin. See [Fischer2005] and [SchmandtBesserat1992] for fuller histories of writing.

The first languages to systematically use letters to represent sounds usually separated words from each other with a consistent mark - generally a bar (“|”) or a double-dot mark much like a colon (“:”). However, many languages with alphabets stopped using explicit word delimiters over time - Latin, Greek, Hebrew and the languages of India were not written with any consistent word marker for many centuries. Whitespace between words was introduced in western Europe in the 12th century, probably invented by monks in Britain or Ireland, and spread slowly to other countries and languages. [Saenger1997] Since the late 19th century, most major languages have been written with regular spaces between words.

For those languages, much but not all of the work of tokenizing digital texts is performed by whitespace characters and punctuation. The simplest tokenizers just split the text up by looking for whitespace, and then separate punctuation from the ends and beginnings of words.

Of major modern languages, only Chinese, Japanese and Korean have writing systems not thought to be derived from a common Middle Eastern ancestor, and they do not systematically mark words in ordinary texts. Tokenization in these languages is a more complex process that can involve large dictionaries and sophisticated machine learning procedures. Vietnamese - which is written with a version of the Latin alphabet today, but used to be written like Chinese - places spaces between every syllable, so that even though it uses spaces, they are of no value in tokenization.

However, even in alphabetic languages with clear word delimiters, tokenization can be very complicated. Usable tokens do not always match the locations of spaces.

Compound words exist in many languages and require more complex processing. The German word Donaudampfschiffahrtselektrizitätenhauptbetriebswerkbauunterbeamtengesellschaft is one famous case, but English has linguistically similar compounds like low-budget and first-class. Tokenizers are often expected to split such compounds up.

In other cases, something best treated as a single token may appear in text as multiple words with spaces, like New York. There are also ambiguous compounds, where they may sometimes appear as separate words and sometimes not. Egg beater, egg-beater and eggbeater are all possible in English and mean the same thing.

Short phrases that are composed of multiple words separated by spaces may also sometimes be best analyzed as a single word, like the phrase by and large or pain in the neck in English. These are called multi-word terms and may overlap with what people usually call idioms.

Contractions like I'm and don't also pose problems, since many higher level analytical tools like part-of- speech taggers and parsers may require them to be broken up, and many linguistic theories treat them as more than one word for grammatical purposes. Phrasal verbs in English like to walk out and separable verbs in German like auffahren are another category of problem for tokenizers, since these are often best treated as single words, but are separated into parts that may not appear near to each other in texts.

Consistent tokenization is generally related to identifying lexical entities that can be looked up in some lexical resource, and this can require very complex processing for ordinary texts. Its purpose is to simplify and standardize the data for the benefit of further processing. Since even digitized texts are very inconsistent in the way they are written, tokenization is the first or nearly the first thing done for any linguistic processing task.

WebLicht currently provides access to a number of tokenizers for different languages. As mentioned in the previous section, tokenization is often combined with sentence-splitting in a single tool.

Part-of-Speech (PoS) Taggers

Part-of-speech taggers are programs that take tokenized texts as input and associate a part-of-speech tag (PoS) with the tokens. Each PoS-tagger uses a specific, closed set of parts-of-speech - usually called a tagset in computational linguistics, and different taggers will routinely have different, sometimes radically different, part-of-speech systems.

A part-of-speech is a category that abstracts some of the properties of words or tokens. For example, in the sentence “The dog ate dinner” there are other words we can substitute for dog and still have a correct sentence, words like cat, or man. Those words have some common properties and belong to a common category of words. PoS schemes are designed to capture those kinds of similarities. Words with the same PoS are in some sense similar in their use, meaning, or function.

Parts-of-speech have been independently invented at least three times in the distant past. The 2nd century BCE Greek grammar text The Art of Grammar outlined a system of nine PoS categories that became very influential in European languages: nouns, verbs, participles, articles, pronouns, prepositions, adverbs, and conjunctions, with proper nouns as a subcategory of nouns. Most PoS systems in use today have been influenced by that scheme.

Modern linguists no longer think of parts-of-speech as a fixed, short list of categories that is the same for all languages. They do not agree about whether or not any of those categories are universal, or about which categories apply to which specific words, contexts and languages. Different linguistic theories, different languages, and different approaches to annotation use different PoS schemes.

PoS tagets also differ in the level of detail they provide. A modern corpus tagset, like the CLAWS tagset used for the British National Corpus - can go far beyond the classical nine parts-of-speech and make dozens of fine distinctions. CLAWS version 7 has 22 different parts-of-speech for common nouns alone! Complex tagsets are usually organized hierarchically, to reflect commonalities between different classes of words.

Examples of widely used tagsets include STTS (Stuttgart-Tübingen Tagset for German), and the Penn Treebank Tagset (for English). Most PoS tagsets were devised for specific corpora, and are often inspired by older corpora and PoS schemes.

PoS taggers are usually machine learning applications that have been trained with data from a particular corpus that has had PoS tags added by human annotators. PoS taggers are often flexible enough to be retrained for new languages. Given a large manually tagged corpus in a particular language, taggers can be retrained to produce similar output from tokenized data.

PoS taggers almost always expect tokenized texts as input, and it is important that the tokens in texts match the ones the PoS tagger was trained to recognize. As a result, it is important to make sure that the tokenizer used to preprocess texts matches the one used to create the training data for the PoS tagger.

One of the more important factors to consider in evaluating a PoS tagger is its handling of out-of-vocabulary words. A significant number of tokens in any large text will not be recognized by the tagger, no matter how large a dictionary they have or how much training data was used. PoS taggers may simply output a special “unknown” tag, or may guess what the right PoS should be given the remaining context. For some languages, especially those with complex systems of prefixes and suffixes for words, PoS taggers may use morphological analyses to try to find the right tag.