Tools in Detail

From WebLichtWiki

(Difference between revisions)
Jump to: navigation, search
(Created page with "== Introduction == Computational linguistic tools are programs that perform analyses of linguistic data, or assist in performing such analyses. This section will provide an i...")
 
Line 16: Line 16:
 
Linguistic tools are usually categorized by the level of analysis they perform, and different tools may operate at different levels and over different units. There are often hierarchal interdependencies between tools - a tool used to perform analysis at one level may require, as input, the results of an analysis at a lower level.
 
Linguistic tools are usually categorized by the level of analysis they perform, and different tools may operate at different levels and over different units. There are often hierarchal interdependencies between tools - a tool used to perform analysis at one level may require, as input, the results of an analysis at a lower level.
  
[[File:Tools-27Fig1.png|x300px|frame|Figure1]]
+
[[File:Tools-LinguisticToolsLevels.png|x400px|frame|Figure1]]
  
 
Figure 1 is a simplified hierarchy of linguistic units, subdisciplines and tools. It does not provide a complete picture of linguistics and it is not necessarily representative of any specific linguistic school. However, it provides an outline and a reference framework for understanding the way hierarchal dependencies between levels of analysis affect linguistic theories and linguistic tools.
 
Figure 1 is a simplified hierarchy of linguistic units, subdisciplines and tools. It does not provide a complete picture of linguistics and it is not necessarily representative of any specific linguistic school. However, it provides an outline and a reference framework for understanding the way hierarchal dependencies between levels of analysis affect linguistic theories and linguistic tools.
Line 46: Line 46:
 
Many tools accept and produce only plain text data, sometimes with specified delimiters and internal structures accessible to ordinary text editors. For example, some tools require each word to be on separate line, others require each sentence on a line. Some will use comma- or tab-delimited text files to encode annotation in their output, or require them as input. These encodings are often historically rooted in the data storage formats of early language corpora. A growing number of tools use XML for either input or output format.
 
Many tools accept and produce only plain text data, sometimes with specified delimiters and internal structures accessible to ordinary text editors. For example, some tools require each word to be on separate line, others require each sentence on a line. Some will use comma- or tab-delimited text files to encode annotation in their output, or require them as input. These encodings are often historically rooted in the data storage formats of early language corpora. A growing number of tools use XML for either input or output format.
  
One of the major axes of difference between annotation programs is inline or stand-off annotation. Inline annotation mixes the data and annotation in a single file or data structure. Stand-off annotation means storing annotations separately, either in a different file, or in some other way apart from language data, with a reference scheme to connect the two.
+
One of the major axes of difference between annotation programs is ''inline'' or ''stand-off'' annotation. Inline annotation mixes the data and annotation in a single file or data structure. Stand-off annotation means storing annotations separately, either in a different file, or in some other way apart from language data, with a reference scheme to connect the two.
  
 
Character encoding formats are an important technical issue when using linguistic tools. Most languages use some characters other than 7-bit ASCII and there are different standards for characters in different languages and operating systems. Unicode and UTF-8 are increasingly widely used for written language data to minimize these problems, but not all tools support Unicode. Full Unicode compatibility has only recently become available on most operating systems and programming frameworks, and many incompatible linguistic tools are still in use. Furthermore, the Unicode standard supports so many different characters that simple assumptions about texts - what characters constitute punctuation and spaces between words - may differ between Unicode texts in ways incompatible with some tools.
 
Character encoding formats are an important technical issue when using linguistic tools. Most languages use some characters other than 7-bit ASCII and there are different standards for characters in different languages and operating systems. Unicode and UTF-8 are increasingly widely used for written language data to minimize these problems, but not all tools support Unicode. Full Unicode compatibility has only recently become available on most operating systems and programming frameworks, and many incompatible linguistic tools are still in use. Furthermore, the Unicode standard supports so many different characters that simple assumptions about texts - what characters constitute punctuation and spaces between words - may differ between Unicode texts in ways incompatible with some tools.

Revision as of 12:49, 22 May 2012

Contents

Introduction

Computational linguistic tools are programs that perform analyses of linguistic data, or assist in performing such analyses. This section will provide an introduction to the general classes of linguistic tools and what purposes they serve.

Many computational linguistic tools, especially the oldest and most widely-used ones, are extensions of pre-computer techniques used to analyze language. Tokenization, part-of-speech tagging, parsing and word sense disambiguation, as well as many others, all have roots in the pre-computer world, some going back thousands of years. Computational tools automate these long-standing analytical techniques, often imperfectly but still productively.

Other tools, in contrast, are exclusively motivated by the requirements of computer processing of language. Sentence-splitters, bilingual corpus aligners, and named entity recognition, among others, are things that only make sense in the context of computers and have little immediate connection to general linguistics but may be very important for computer applications.

Linguistic tools can also encompass programs designed to enhance access to digital language data. Some of these are extensions of pre-computer techniques like concordances and indexes, but can also include more recent developments like search engines. Textual information retrieval is a large field and this document will discuss only search and retrieval tools specialized for linguistic analysis.


Hierarchies of Linguistic Tools

Linguistic tools are often interdependent and frequently incorporate some elements of linguistic theory. Modern linguistics draws on traditions of structuralism, a paradigm and school of thought in the humanities and social sciences dating to the early 20th century. Structuralism emphasized the study of phenomena as hierarchal systems of elements, organized into different levels of analysis, each with their own units, rules,and methodologies. Linguists, in general, organize their theories in ways that show structuralist influences, although many disclaim any attachment to structuralism. Linguists disagree about what levels of analysis exist, to what degree different languages might have differently organized systems of analysis, and how those levels interrelate. However, hierarchal systems of units and levels of analysis are a part of almost all linguistic theories are very heavily reflected in the practices of computational linguists.

Linguistic tools are usually categorized by the level of analysis they perform, and different tools may operate at different levels and over different units. There are often hierarchal interdependencies between tools - a tool used to perform analysis at one level may require, as input, the results of an analysis at a lower level.

Figure1

Figure 1 is a simplified hierarchy of linguistic units, subdisciplines and tools. It does not provide a complete picture of linguistics and it is not necessarily representative of any specific linguistic school. However, it provides an outline and a reference framework for understanding the way hierarchal dependencies between levels of analysis affect linguistic theories and linguistic tools.

The hierarchal relationship between levels of analysis in Figure 1 generally applies to linguistic tools. Higher levels of analysis generally depend on lower ones. Syntactic analysis like parsing usually requires words to be clearly delineated and part-of-speech tagging or morphological analysis to be performed first. This means, in practice, that texts must be tokenized, their sentences clearly separated from each other, and their morphological properties analyzed before parsing can begin. In the same way, semantic analysis is often dependent on identifying the syntactic relationships between words and other elements, and inputs to semantic analysis tools are often the outputs of parsers. Higher level analyses tend to be dependent on lower level ones.

However, this simplistic picture has many important exceptions. Lower level phenomena often have dependencies on higher level ones. Correctly identifying the part-of-speech, lemmas, and morphological categories of words may depend on a syntactic analysis. Phonetic and phonological analysis can affect with morphological and syntactic analysis. Even speech recognition - one of the lowest level tasks - depends strongly on knowledge of the semantic and pragmatic context of speech.

Furthermore, there is no level of analysis for which all linguists agree on a single standard set of units of analysis or annotation scheme. Different tools will have radically different inputs and outputs depending on the theoretical traditions and commitments of their authors.

Most tools are also language specific. There are few functional generalizations between languages that can be used to develop single tools that apply to multiple languages. Different countries with different languages often have very different indigenous traditions of linguistic analysis, and different linguistic theories are popular in different places, so it is not possible to assume that tools doing the same task for different languages will necessarily be very similar in inputs or outputs.

Corpus and computational linguists often work with written texts, and therefore usually avoid doing any kind phonetic analysis. For that reason, this chapter will not discuss speech recognition and phonetic analysis tools, although some of the multimedia tools discussed in this chapter can involve some phonetic analysis. At the highest levels of analysis, tools are very specialized and standardization is rare, so few classes of very high linguistic level tools are discussed here.


Automatic and Manual Analysis Tools

Although some tools exist to just provide linguists with interfaces to stored language data, many are used to produce an annotated resource. Annotation involves the addition of detailed information to a linguistic resource about its contents. For example, a corpus in which each word is accompanied by a part-of-speech tag, or a phonetic transcription, or in which all named entities are clearly marked, is an annotated corpus. Linguistic data in which the syntactic relationships between words are marked is usually called a treebank. The addition of annotations increases the accessibility and usability of resources for linguists and may be required for further computer processing.

Automatic annotation tools add detailed information to language data on the basis of procedures written into the software, without human intervention other than to run the program. Automatic annotation is sometimes performed by following rules set out by programmers and linguists, but most often, annotation programs are at least partly based on machine learning algorithms and that are trained using manually annotated examples.

Automated annotation processes almost always have some rate of error, and where possible, researchers prefer manually annotated resources with fewer errors. These resources are, naturally, more expensive to construct, rarer and smaller in size than automatically annotated data, but they are essential for the development of automated resources and necessary whenever the desired annotation either has not yet been automated or cannot be automated. Various tools exist to make it easier for people to annotate language data.


Technical issues in linguistic tool management

Many of the most vexing technical issues in using linguistic tools are common problems in computer application development. Data can be stored and transmitted in an array of incompatible formats, and either there are no standards, or too many standards, or standards compliance is poor. Linguistic tool designers are rarely concerned with those kinds of issues or specialized in resolving them.

Many tools accept and produce only plain text data, sometimes with specified delimiters and internal structures accessible to ordinary text editors. For example, some tools require each word to be on separate line, others require each sentence on a line. Some will use comma- or tab-delimited text files to encode annotation in their output, or require them as input. These encodings are often historically rooted in the data storage formats of early language corpora. A growing number of tools use XML for either input or output format.

One of the major axes of difference between annotation programs is inline or stand-off annotation. Inline annotation mixes the data and annotation in a single file or data structure. Stand-off annotation means storing annotations separately, either in a different file, or in some other way apart from language data, with a reference scheme to connect the two.

Character encoding formats are an important technical issue when using linguistic tools. Most languages use some characters other than 7-bit ASCII and there are different standards for characters in different languages and operating systems. Unicode and UTF-8 are increasingly widely used for written language data to minimize these problems, but not all tools support Unicode. Full Unicode compatibility has only recently become available on most operating systems and programming frameworks, and many incompatible linguistic tools are still in use. Furthermore, the Unicode standard supports so many different characters that simple assumptions about texts - what characters constitute punctuation and spaces between words - may differ between Unicode texts in ways incompatible with some tools.