Linguistics Research Guide: Corpora

Introduction

The Libraries are dedicated to furthering the fields of computational linguistics and natural language processing at Columbia. This guide serves as an entry point to researchers in those fields. As these fields are closely related in many ways with text and data mining, we encourage researchers to also look at our research guide on text mining for even more resources that might assist in answering computational, language-related research questions.

The Columbia University Libraries' Research Data Services (RDS) team provides expert advice on identifying and using corpora for many different types of scholarly projects, including in the field of linguistics. RDS also provides information on corpora licensed by the Libraries for the Columbia community.

TL;DR: Email Research Data Services (data@library.columbia.edu) to set up a consultation for your text mining research project.

Available Licensed Corpora

This list will grow over time as the Columbia research community builds more bridges to various data sources for computational linguistics and natural language process.

  • Linguistic Data Consortium: With a membership paid for by the Department of Computer Science, the LDC is a consortium of researchers who consolidate various data sets at the LDC’s website at Penn. Access to the LDC is regulated by the Department.
  • Full-Text Corpus Data: The Libraries have acquired several English (and Portuguese and Spanish) corpora colloquially known as the “Davies Corpora.” These include the Corpus of Historical American English and the Corpus of Contemporary American English.

Other Linguistic Corpora and Resources

Tools