Research Guides: Linguistics Research Guide: Corpora

Library

Introduction

The Libraries are dedicated to furthering the fields of computational linguistics and natural language processing at Columbia. This guide serves as an entry point to researchers in those fields. As these fields are closely related in many ways with text and data mining, we encourage researchers to also look at our research guide on text mining for even more resources that might assist in answering computational, language-related research questions.

The Columbia University Libraries' Research Data Services (RDS) team provides expert advice on identifying and using corpora for many different types of scholarly projects, including in the field of linguistics. RDS also provides information on corpora licensed by the Libraries for the Columbia community.

TL;DR: Email Research Data Services (data@library.columbia.edu) to set up a consultation for your text mining research project.

Available Licensed Corpora

This list will grow over time as the Columbia research community builds more bridges to various data sources for computational linguistics and natural language process.

Linguistic Data Consortium: With a membership paid for by the Department of Computer Science, the LDC is a consortium of researchers who consolidate various data sets at the LDC’s website at Penn. Access to the LDC is regulated by the Department.
Full-Text Corpus Data: The Libraries have acquired several English (and Portuguese and Spanish) corpora colloquially known as the “Davies Corpora.” These include the Corpus of Historical American English and the Corpus of Contemporary American English.

Other Linguistic Corpora and Resources

Archive of the Indigenous Languages of Latin America (AILLA)
AILLA is a digital language archive of recordings, texts, and other multimedia materials in and about the Indigenous languages of Latin America. AILLA's mission is to preserve these materials and make them available to Indigenous Peoples, researchers, and other friends of these languages now and for generations to come.

Dictionary of old English Web Corpus
A corpus of all the extant works of Old English, comprising over 3,000 different texts and constructed from a version supplied by the Dictionary of Old English Project in 1998.

Linguist List
Clearinghouse and forum for a wide variety of linguistics topics.
Linguistic Data Consortium Catalog
The LDC's catalog of linguistic corpora.
Open Language Archives Community (OLAC)
A worldwide virtual library of language resources. Use the search engine to find relevant collections.
PHOIBLE Online
"PHOIBLE is a repository of cross-linguistic phonological inventory data, which have been extracted from source documents and tertiary databases and compiled into a single searchable convenience sample. Release 2.0 from 2019 includes 3020 inventories that contain 3183 segment types found in 2186 distinct languages." [from website]
The Speech Accent Archive
"The speech accent archive uniformly presents a large set of speech samples from a variety of language backgrounds. Native and non-native speakers of English read the same paragraph and are carefully transcribed. The archive is used by people who wish to compare and analyze the accents of different English speakers." [Website description]
WebCorp Linguist's Search Engine
"WebCorp Live lets you access the Web as a corpus."
WordNet: A Lexical Database for English
A large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.

Tools

Natural Language Toolkit (NLTK)
A platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and an active discussion forum.
spaCy
Open-source natural language processing software.