0% found this document useful (0 votes)
13 views5 pages

Huang 2015

This article provides an overview of corpus linguistics, defining a corpus as a systematically compiled collection of language examples and discussing its applications in various linguistic fields. It outlines the processes of corpus construction, including design criteria, data collection, and annotation, while distinguishing between types of corpora such as balanced, specialized, synchronic, and diachronic. The article also highlights the importance of digital tools in corpus linguistics and the future directions of the field.

Uploaded by

nida ashiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views5 pages

Huang 2015

This article provides an overview of corpus linguistics, defining a corpus as a systematically compiled collection of language examples and discussing its applications in various linguistic fields. It outlines the processes of corpus construction, including design criteria, data collection, and annotation, while distinguishing between types of corpora such as balanced, specialized, synchronic, and diachronic. The article also highlights the importance of digital tools in corpus linguistics and the future directions of the field.

Uploaded by

nida ashiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Corpus Linguistics

Chu-Ren Huang and Yao Yao, Hong Kong Polytechnic University, Hong Kong
Ó 2015 Elsevier Ltd. All rights reserved.
This article is a revision of the previous edition article by G. Kennedy, volume 4, pp. 2816–2820, Ó 2001, Elsevier Ltd.

Abstract

This article introduces basic concepts of a modern linguistic corpus and corpus linguistics. A corpus is defined as a collection
of examples of language in use that are selected and compiled in a principled way and corpus linguistics as linguistic studies
of such corpora. We explicate classification, basic procedures of data collection, construction, and annotation of corpora.
Representative research areas and applications where corpus and corpus-based analysis play crucial roles are also introduced.
Finally, trends and future directions of development of corpus linguistics are discussed.

What Is a Corpus? corpus size and corpus design. In subsequent years, however,
the standard of corpus size increased rapidly. Thanks to the
A corpus is a collection of examples of language in use that are enhanced storage and processing power of modern computers,
selected and compiled in a principled way. The term corpus it is now not uncommon to encounter a corpus that contains
refers to the intention for it to be a representative body of hundreds or thousands of millions of words. What came hand
evidence for the study of language and language use. In the in hand with the increase in scale was the expansion in genre. In
most general terms, the purpose of a corpus is to document addition to the so-called ‘balanced’ corpora (see below for the
a language. Hence a corpus is often an essential component of definition of balanced) such as the Brown corpus, corpora have
language documentation or language archives. Corpora are often also been developed for highly specialized domains and with
used in language sciences, including linguistics, computational unconventional data types (e.g., multimodal language data,
linguistics (CL), language education psycholinguistics, socio- text collected from the World Wide Web (WWW)).
linguistics, and translation studies, as well as other studies
where a collection of texts or other language uses are crucial,
such as anthropology, communication studies, literary studies, Building a Corpus
or political science.
Design Criteria
The construction of a corpus starts with decisions on design
What Is Corpus Linguistics? criteria. Corpus design criteria are mainly driven by the purpose
of the corpus, but may also be affected by meta-theoretical
The term corpus linguistics refers to corpus-based linguistic concerns such as evaluation methods, reusability, and inter-
studies in general (Biber et al., 1998; Tognini-Bonelli, 2001, operability. Pragmatic concerns such as data availability, shar-
among others). Archetypical corpus work existed well before ability, as well as budgetary and technical constraints may also
the modern digital era, as exemplified by the early attempts of play a role. In setting up a set of design criteria, it is important
word indexing and concordancing of the Christian Bible in the to bear in mind that a corpus is meant to be a representative
thirteenth century. However, the emergence of corpus linguis- body of linguistic data as evidence; hence the question of
tics as an academic discipline is closely linked to the availability whether the design criteria will yield the right sample of data to
of digital tools to record, store, and examine corpora, heralded address the questions posed must be asked. The design criteria
by the rapid development of computational technology as well will determine what kind of data to collect, how much and how
as the increasing availability of digital content since the second to collect, what kind of information will be annotated on the
half of the twentieth century. Although the use of computa- data, and by what method.
tional and statistical tools plays crucial roles in corpus
linguistics now, it is important to note that the two essential
Data Collection
elements in corpus linguistics are the study of linguistic issues
and the use of corpora. Thus ‘computer-aided armchair Corpus construction begins with the collection of raw language
linguistics,’ the term Charles Fillmore fondly coined in 1992 data. As stated earlier, corpus developers need to first decide on
(Fillmore, 1992), still applies to corpus linguistics to the extent the purpose of the corpus (i.e., Language use in which domain(s)
that both linguistic theories and argumentation are required to of human life will be represented in this corpus?) based on which
lay the foundation of corpus-based processing and analysis. they may decide further where to collect data. Data collection
The first modern corpus is the Brown University Standard in the earlier stage of corpus linguistics involves both selection
Corpus of Present-day American English (‘the Brown corpus’; and digitization of text. With the advance of the WWW and the
Francis and Kucera, 1979; Kucera and Francis, 1967), compiled accessibility of digital content, data collection of modern
at Brown University in the 1960s. With about one million corpora typically involves manual or automatic selection or
words of text collected from a balanced, wide variety of sources, extraction of electronic texts. Both methods, however, require
the Brown corpus represented a breakthrough in terms of both an elaborated and challenging effort to clear copyrights to the

International Encyclopedia of the Social & Behavioral Sciences, 2nd edition, Volume 4 http://dx.doi.org/10.1016/B978-0-08-097086-8.52004-2 949
950 Corpus Linguistics

acquired texts, unless a specific text is already in the public works is synchronic, but a corpus of English literature in the
domain. The recent emergency of the Web as Corpus (WAC, see seventeenth–twentieth centuries would be diachronic.
Baroni et al., 2009) approach, which treats the WWW as Diachronic corpora are often used in historical linguistics and
a source of data, has the potential to side-step some, but not all, the study of language change, thus the time span must be long
copyright issues. Under the WAC approach, computer enough for changes to be observed. The emergent field of
programs (which are called robots) are designed to crawl the evolutionary linguistics also relied heavily on diachronic
web and collect data according to a set of predetermined corpora for modeling and simulation. Most balanced corpora
criteria. Robots are usually designed to extract only a small are synchronic, simply because compiling a balanced corpus
portion of text from each website in order to meet the fair-use with enough data to cover an extensive time span would be too
requirement. Another WAC method to circumvent the rights overwhelming of a task.
issue is to collect as much data as needed but only share the
extracted generalization, and never the original data in pure text Spoken, Written, or Mixed
form, to users. Two best known corpora taking this approach A spoken corpus contains transcribed speech, whereas a written
are Google’s n-gram corpus, which gives users access only to the corpus contains only written text. Of course it is possible to
collection and statistics of n-grams (n consecutive words) but have both spoken data and written data in a corpus, resulting in
not the original text; and Liverpool’s WebCorp, where a mixed corpus. An example of written corpora is the Brown
concordance lines are dynamically extracted and not saved. corpus, which only included published written works. Some
Discussion on corpus and corpus construction will be focused more recent balanced corpora, such as BNC, included both
on the typical corpus as a collection of text in what follows. written works and transcribed spoken data, in recognition of
the vast differences between written and spoken language
Balanced versus Specialized (Chafe and Tannen, 1987; Halliday, 1994). The past three
A balanced corpus (also referred to as a ‘general corpus’) is decades have witnessed the emergence of a number of corpora
designed to represent the use of a certain language in all with only spoken data, such as the SWITCHBOARD corpus of
domains with the consideration of relative frequency of telephone speech and the Buckeye corpus of conversational
language use in each domain. Therefore, the data of a balanced speech. Since these corpora usually contain speech recordings
corpus are representative of all topics, genres, styles, and and phonetic transcription alongside word transcription, they
registers, coming from a potpourri of sources which often are often used for studying phonetics and phonology, as well as
include (but not limited to) books, magazines, newspapers, speech recognition and processing.
academic journals, and TV programs, with a controlled
proportion of each data source and data type. The proportions Single Language versus Multilanguage
are decided by the corpus developers in some principled way, As the names suggest, a single-language (or monolingual) corpus
e.g., according to the number of publications in each source/ contains data from only one language, whereas a multilanguage
type in a certain time period. Most existing balanced corpora (or multilingual) corpus contains data from more than one
are of the English language, with the Brown corpus setting the language. A multilanguage corpus can be either a parallel
standard for early corpora, and the British National Corpus corpus or a comparable corpus (see below for the differences
(BNC) setting the standard for more recent work. Balanced between the two). Earlier multilanguage corpora are mostly
corpora of other languages are also available, such as the parallel corpora, while comparable corpora became more
Academia Sinica Balanced Corpus of Modern Chinese (Sinica popular recently with the availability of original electronic data
Corpus) and the Balanced Corpus of Contemporary Written from more languages on the web. Multilanguage corpora are
Japanese (KOTONOHA or BCCWJ). Balanced corpus can also widely used in translation studies, computer-assisted trans-
be constructed to represent variations of the same language, lation, machine translation, and other language-related
such as the International Corpus of English. computer technologies. The first widely used multilanguage
A specialized corpus, as the name suggests, only represents corpus is the Hansard Corpus of English and Canadian French
the use of language in a specific domain or for a specific topic. from Canadian parliamentary proceedings, which contain texts
The range of specialized corpus varies widely from single in both English and Canadian French.
author corpus (e.g., a Shakespeare corpus), single genre corpus
(e.g., a Twitter corpus) to single domain corpus (often a corpus Parallel versus Comparable
of language for specific purpose, e.g., a corpus of engineering A parallel corpus contains texts of the same content in different
English) and so on. A specialized corpus can also be complied languages (e.g., an original text and its translation(s) in one or
for the purpose of a specific research topic area, such as more other languages). Hence directionality is crucial for
a learner’s corpus for the study of second language learning, an a parallel corpus. For example, a parallel English–French
acquisition corpus for the study of first language education corpus will be different from a parallel French–English corpus,
(e.g., CHILDES), an emotion corpus, a language disorder even if they are both based on the same text. The parallel
corpus, or a metaphor corpus. relation between the source and target also varies according to
the nature and purpose of translation. Parallel corpora con-
Synchronic versus Diachronic taining texts with legal or documentary functions, such as
A synchronic corpus contains language data that are produced in parliamentary proceedings, tend to maintain lexico-syntactic
roughly the same time period, whereas a diachronic corpus correspondences, while those containing texts that are infor-
contains data from different time periods, which may or may mative or expressive tend to maintain semantic-pragmatic
not be consecutive. For instance, a corpus of Shakespeare’s correspondences. One of the most important steps in
Corpus Linguistics 951

processing a parallel corpus is alignment, which can be XML schema for encoding of all levels of corpus annotation.
performed at lexical, phrasal, sentential, or paragraph level, TEI aims to ensure that members of its community of practice
depending on the nature of the corpus as well as the purpose of will be able to share their annotated digital output. Even
the study. Comparable corpora, on the other hand, are multi- though TEI is designed for the wider community for all digi-
language corpus containing a collection of texts from two or talized texts, it has been adapted as the standard format for
more languages collected under the same set of criteria. corpus annotation. First, for the encoding of metadata, Dublin
Comparable corpora are less costly to construct since they do Core is the de facto standard of the open archives community,
not rely on translators to create the target texts. They also allow and is adapted by the Open Language Archives Community for
comparative study of two languages in similar native language language resources including corpora. Second, the tagging of
production context, in contrast to parallel corpora, where target linguistic information, which is the most common and most
language text is more restricted and may contain unnatural versatile level of annotation, can be applied to all kinds of
texts. It is also possible to construct a large comparable corpus corpora and can easily be reused. POS tagging is the most
without having to translate all the texts from source language. commonly applied annotation. When literatures in corpus
Hence the comparable corpus approach opens up the possi- linguistics refer to a tagset, it means the set of POS tags used in
bility of constructing a multilanguage corpus in language pairs a corpus by default. All currently available balanced corpora are
without qualified translators or sufficient pre-existing POS-tagged, and a corpus that is annotated with other infor-
translated texts. mation typically has a name that explicitly refers to that
annotation level, e.g., prosody corpus (a syntactically anno-
tated corpus, however, is conventionally called a treebank). It is
Annotation and Storage
also important to note that although the most commonly
As mentioned earlier, modern corpora are digitally stored. In assumed unit in a corpus is a word, it is possible to annotate
particular, to facilitate future processing by computer units smaller than words. For instance, a corpus of speech data
programs, corpora must be stored in a ‘machine-readable’ can be segmented into phonemic or phonetic units, which are
electronic format. To this end, a popular approach is to use the labeled with segmental phonetic transcriptions, although the
Extensible Markup Language (XML) for encoding and storing same corpus could also be annotated with suprasegmental
corpus files. Raw language data may require some preprocess- features that span across multiple segments (or even words).
ing, e.g., converting text on paper to electronic text, transcribing The General Ontology of Linguistic Description is a commu-
speech, and adjusting electronic text format, before they can be nity of practice effort to define and standardize all linguistic
stored in an XML format. features that can be annotated on a language resource.
An XML file contains two types of elements: content and A parallel effort to provide a uniform and sharable linguistic
markup. In a corpus XML file, content mainly consists of the annotation scheme is the Lexical Markup Framework group
linguistic data collected for the corpus. Markup, on the other (Francopoulo, 2013), which works in conjunction with ISO
hand, is a major place for adding annotation to the linguistic TC37 SC4 to produce an ISO standard for linguistic annota-
data, which may be any extra information about the text. tion: ISO 24613:2008. Finally, a corpus can also be annotated
Corpus annotation has become a critical step in corpus with extralinguistic features to enable it to support specific
construction (see Stede and Huang, 2012) in spite of the fact research. For instance, in a Learner’s Corpus, the common
that annotation is not a mandatory component of a corpus and annotation scheme, on top of POS tagging, identifies the errors
that corpus annotation is typically costly and labor-intensive. A and error types made by a language learner. An emotion corpus
simple reason for the convention is that annotation enriches can be annotated with both the emotion type and emotion-
a corpus and facilitates discovery of generalizations and related event structure.
knowledge which is difficult, if not impossible, to obtain from Standard procedure of corpus annotation starts with draft-
a raw corpus. For example, the annotation of author gender ing of annotation guidelines based on linguistics or other
will be crucial for analyzing any gender-related phenomenon theoretical criteria. Annotators are trained and their perfor-
in language use. Take another example. Many English words mance is evaluated before given the actual task. Annotation
may be used with more than one part of speech (POS) (e.g., tasks are usually carried out in a double-blind manner,
‘table,’ ‘convict’). If someone is to research the use of ‘table’ as meaning a pair (or more) of annotators annotating the same
a verb, it will be much more convenient if he or she uses text without input from each other. After the annotation is
a corpus with POS tags, which would allow word search to be completed, the annotations where both (or all) annotators
restricted by POS and therefore prevent search results to be agree on are assumed to be correct, and the annotations where
overwhelmed by instances of ‘table’ used as a noun. In addi- there are disagreements, they are presented to an adjudicator,
tion, POS tagging also facilitates various natural language often the person who drafted the guidelines, to decide upon.
processing (NLP) applications such as information retrieval Hence, the measurement of interannotator agreement is often
and machine translation. taken to be a good indicator of the quality of corpus annota-
A corpus can be annotated at three different levels. Most tion. Although the earliest corpus started with manual anno-
corpora are at least encoded with some basic metadata (such as tation, modern corpus annotation is typically carried out
author, publication date, genre) which are attributes of the text electronically with an annotation tool. The tool may also have
as a whole. There are also corpora in which the texts are tagged the capacity to provisionally predict the annotation based on
with grammatical information, phonetic transcription, or other either statistic or heuristic rules. The annotator’s main role is
linguistic or extralinguistic information for a specific purpose then to correct wrong predictions or to supply annotation
of study. Overall, the text encoding initiative (TEI) provided the when the program fails to provide any annotation. Fully
952 Corpus Linguistics

automatic annotation can be employed when the corpus scale words which undergo regular or irregular morphology to be
is too big for human annotation, such as with the WAC recognized as tokens of the same type of words (e.g., ‘buy’ and
approach. In this case, a tolerance of error rate is predetermined ‘bought’). It also enables generalizations of the behaviors of
and the program is tested on a sample corpus with gold stan- affixes. Second, a POS tagger is dependent on a well-designed
dard (i.e., correct) annotation to make sure that it does not tagset, a comprehensive lexicon as well as strategic information
make excessive errors before the program is deployed. Such from distributional properties of each POS. Finally, a parser
automatic annotation programs typically employ either can assign syntactic structure to a corpus, which is not only
machine learning algorithms or stochastic models; they are crucial for construction of treebanks but also allows event
sometimes also supplemented by heuristic or hand-crafted structure to be defined and annotated on a corpus. All these
linguistic rules. processing tools lay the foundation for more advanced corpus
Crowdsourcing is a recent innovation in corpus annotation analysis such as word frequency counting or automatic iden-
which has the potential to generate a paradigm shift. Crowd- tification of grammatical relations (see below for more
sourcing relies on web users to perform tasks that require discussion on corpus analysis).
human intelligence, typically the kind of tasks that a computer
program has limitations in. Corpus annotation falls nicely to
Applications of Corpus
the category of human intelligence tasks. The typical procedure
for crowdsourced annotation also starts with the design of The first and best known application of corpus is in lexicog-
annotation guidelines, especially with presenting the guide- raphy. The COBUILD team demonstrated that high-quality
lines with examples to ensure least ambiguity for workers who dictionaries and grammars can be produced based on corpus
take on the task. The corpus owner then publishes this task on data (Sinclair, 1987). Today, all major dictionaries/lexica in
the web through a crowdsourcing platform, with optional English are corpus based and some with a very powerful user
monetary compensation. The workers who accept the task must interface are able to provide linguistic generalization auto-
typically pass a pre-test to show that they have the knowledge matically. Since the first application in lexicography, corpus has
or skill to perform the task. Since the compensation is low and played an increasingly important role in general linguistic
since the number of potential works on the web is big, research. By definition, a corpus contains a large amount of
crowdsourcing allows a corpus to be annotated by multiple naturally occurring language data and therefore becomes an
annotators with a fraction of the cost of dedicated annotators. ideal data source for investigating language and language use. It
The larger number of annotators allows corpus owner to have is not uncommon now for a study of syntax or semantics to cite
a wider range and more reliable data to determine which example sentences collected from natural corpora. For this
annotation is more reliable. The most popular crowdsourcing purpose, the most often used corpus analyses are word
platforms for language-related studies are Amazon Mechanic frequency counting, concordance, and keyword in context, all
Turks and Crowdflower. of which are standard functions available in most corpus
websites and corpus analysis software. A more advanced tool is
Sketch Engine (Kilgarriff et al., 2004), which automatically
Corpus Linguistics: Research and Applications extracts grammatical relations based on statistical patterns in
the corpus. In addition to syntactic and semantic research, the
The topics in corpus linguistics research are not different from use of corpus is especially important in discourse analysis and
(computational) linguistic research. The only differences are in studies of language variation. While discourse analysis usually
the approaches to how data are collected and to how general- employs qualitative analysis aided by the commonly used
izations are arrived. Hence, we will focus on research topics corpus analysis tools listed earlier, a corpus-based study of
generated by and solved with corpus linguistics. language variation typically uses quantitative analysis with
statistical models built on large datasets extracted from the
corpus for the examination of variation patterns.
Corpus Processing Tools
Today, corpus has also become an indispensable compo-
The design and improvement of corpus processing tools is an nent in CL research. Currently almost all major CL applications
ongoing issue in corpus linguistics. The Natural Language (e.g., information retrieval, sentiment and emotion analysis,
Toolkit gives a good overview of what kind of tools and machine translation, speech recognition and synthesis) heavily
resources are available or required for corpus processing. First rely on corpus data and the use of corpus-based approaches.
of all, tokenizer or lemmatizer is often required as the first step Since the core of a CL application often involves recognizing
of corpus processing. Since not all languages conventionally some patterns or features of the language data, the application
mark the boundaries of words, and even with those that do, typically makes use of two sets of corpus data: a training set and
not all boundaries are consistently marked (e.g., ‘the White a testing set. The training set contains data that have already
House’ is a single word), hence tokenization is an important been tagged (often manually) with gold standard annotation,
first step for languages without conventionalized word which are then used to train the CL program to calculate the
boundary markers. For example, the tokenization task for heuristic and statistical rules for automatic annotation. Good-
Mandarin Chinese, which is more specifically referred to as ness of the program may be evaluated by comparing the
Chinese word segmentation, remains an active research topic annotation results with the gold standard annotation on the
(see Huang and Xue, 2012). Lemmatizer, on the other hand, same dataset. Later, the trained CL program will be applied to
can play the same function as tokenizer in languages with rich the testing dataset in order to obtain automatic evaluation.
morphology. At word level, it serves the purpose of enabling Obviously, the more similar the training set and the testing set
Corpus Linguistics 953

are, the more likely it is for the trained CL program to gain Bibliography
success with the unseen data in the testing set. In reality, it is
often recommended that the two datasets are sampled from the Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E., 2009. The WaCky wide web:
same corpus (or corpora of a similar type) in order to ensure a collection of very large linguistically processed web-crawled corpora. Language
Resources and Evaluation 43 (3), 209–226.
high data similarity. The earlier description also indicates
Biber, D., Conrad, S., Reppen, R., 1998. Corpus Linguistics: Investigating Language
a close relationship between corpus linguistics and CL. While Structure and Use. Cambridge University Press, Cambridge.
corpora provide the essential resources for CL research, the Chafe, W., Tannen, D., 1987. The relation between written and spoken language.
latter also contributes novel and efficient ways to process and Annual Review of Anthropology 16, 383–407.
annotate corpus data. Fillmore, C., 1992. ‘Corpus linguistics’ or ‘computer-aided armchair linguistics’. In:
Svartvik, J. (Ed.), Directions in Corpus Linguistics: Proceedings of Nobel
Symposium 82 (Trends in Linguistics Studies and Monographs 65). Mouton de
Gruyter, Berlin, pp. 35–60.
Trends and Future Directions of Corpus Linguistics Francis, W.N., Kucera, H., 1979. Brown Corpus Manual (Revised and Amplified),
Department of Linguistics. Brown University, Providence, RI.
There are a number of noteworthy trends in the development
Francopoulo, G. (Ed.), 2013. LMF: Lexical Markup Framework, Theory and Practice.
of corpus and corpus linguistics. First of all, as mentioned ISTE, London.
earlier, a general trend is that corpora are becoming increas- Halliday, M.A.K., 1994. Spoken and written modes of meaning. In: Graddol, D.,
ingly larger than before, thanks to the development of Barret, O.B. (Eds.), Media Texts: Authors and Readers: A Reader. Multilingual
computers and corpus technology. Second, the establishment Matters, Clevedon/Philadelphia, pp. 51–73.
Huang, C.-R., Xue, N., 2012. Words without boundaries: computational approaches to
of various corpus construction and annotation standards Chinese word segmentation. Language and Linguistics Compass 6 (8), 494–505.
promises the emergence of more ‘standardized’ corpora with Kilgarriff, A., Rychlý, P., Smrz, P., Tugwell, D., 2004. The sketch engine. In:
high reusability and interoperability. In the past decade, we Proceedings of the 11th EURALEX Congress, Lorient, France.
have also noted the fast development of multimodal corpora, Kucera, H., Francis, W.N., 1967. Computational Analysis of Present-Day American
English. Brown University Press, Providence, RI.
which contain not only textual data but also nontextual audio
Stede, M., Huang, C.-R. (Eds.), 2012. Special Issue: Linguistic Annotation, Language
or video data, making it possible to study language use in Resources and Evaluation, vol. 46 (1).
a broader cognitive and communicative context. Such corpora Sinclair, J.M. (Ed.), 1987. Looking up: An Account of the COBUILD Project in Lexical
have been proved highly useful in the study of child language Computing. Collins ELT, London and Glasgow.
development and nonverbal communication. Finally, there are Tognini-Bonelli, E., 2001. Corpus Linguistics at Work. John Benjamins, Amsterdam
and Philadelphia.
also some emerging trends related to the development of the
WWW. As noted before, novel technology such as crowd-
sourcing has been employed in the process of corpus Relevant Websites
construction, making annotation more efficient and economic.
Another trend is the use of language data crawled from the web, http://icame.uib.no/brown/bcm.html – Brown Corpus Manual.
i.e., the WAC movement, which easily produces gigantic http://www.natcorp.ox.ac.uk/ – British National Corpus (BNC).
corpora of naturally produced text although questions about http://crowdflower.com/ – Crowdflower Crowdsourcing Platform.
data quality also arise. http://www.elda.org/ – Evaluations and Language Resources Distribution
Agency (ELDA).
Overall we are highly optimistic about the future of corpus
http://linguistics-ontology.org/ – General Ontology for Linguistic Description (GOLD).
linguistics. Given the current achievement and the emerging http://www.ldc.upenn.edu/ – Lexical Data Consortium (LDC).
trends, we are looking forward to a future where corpora make http://www.lexicalmarkupframework.org/ – Lexical Markup Framework (LMF).
even greater contributions to linguistic research and applica- http://nltk.org/ – Natural Language Toolkit (NLTK).
tions, language professions, and the broader fields of social and http://www.sinica.edu.tw/SinicaCorpus/ – Sinica Corpus.
http://wacky.sslmit.unibo.it/ – Web-As-Corpus Kool Yinitiative (WaCky).
behavioral sciences.

See also: Areal Linguistics; Dialectology; Etymology;


Indo-European Languages; Information Structure in
Linguistics; Language and Society; Linguistic Typology; Pidgin
and Creole Language Varieties; Sapir–Whorf Hypothesis.

You might also like