Columbia University Libraries has licensed the use of several corpora in English, Spanish, and Portuguese, often colloquially called “the Davies corpora.” This includes the Corpus of Contemporary American English (COCA) and Corpus of Historical American English (COHA).
This information comes from english-corpora.org, the home of the web interface to the following corpora. Columbia makes all of these corpora available for download for researchers, as indicated below.
Corpus | Overview | Size | Dialect | Time period | Genre(s) |
---|---|---|---|---|---|
News on the Web (NOW) | Overview | 20.3 billion+ | 20 countries | 2010–yesterday | Web: News |
iWeb: The Intelligent Web-based Corpus | Overview | 14 billion | 6 countries | 2017 | Web |
Global Web-Based English (GloWbE) | Overview | 1.9 billion | 20 countries | 2012–13 | Web (incl blogs) |
Wikipedia Corpus | Overview | 1.9 billion | (Various) | 2014 | Wikipedia |
Coronavirus Corpus | Overview | 1.5 billion | 20 countries | 2020–2023 | Web: News |
Corpus of Contemporary American English (COCA) | Overview | 1.0 billion | American | 1990–2019 | Balanced |
Corpus of Historical American English (COHA) | Overview | 475 million | American | 1820–2019 | Balanced |
The TV Corpus | Overview | 325 million | 6 countries | 1950–2018 | TV shows |
The Movie Corpus | Overview | 200 million | 6 countries | 1930–2018 | Movies |
Corpus of American Soap Operas | Overview | 100 million | American | 2001–2012 | TV shows |
Corpus | Overview | Size | Created |
---|---|---|---|
Web / Dialects | Overview | 2 billion words | 2016 |
Corpus | Overview | Size | Created |
---|---|---|---|
Web / Dialects | Overview | 1 billion words | 2016 |
The Libraries make the above corpora available for download. Each corpus is made up of at least five files (original documentation):
File | Format | Columns |
---|---|---|
<corpus>-db.tar |
SQL-style database | textID , ID (1 - n), wordID (link to lexicon) |
<corpus>-wlp.tar |
Word, lemma, and part of speech in vertical format | textID , ID (1 - n), word , lemma , PoS |
<corpus>-text.tar |
Each entire text on its own line line. | textID , text |
<corpus>-lexicon.zip |
Lexicon table for use with db. | wordID (link to database), word , lemma , PoS |
<corpus>-sources.zip |
Corpus-specific table of sources. | textID , #words , and other column |
Each tar or zip file will typically contain several text files. Above, each unique text has a textID
identified in the sources file and linked to from one of the corpus files. The wordID
identifies each unique word, with its lemma and part of speech and is linked to from the db
and wlp
files.
Researchers must read and agree to the restrictions over the data after contacting data@library.columbia.edu
. For the time being, Research Data Services will then make the files available for download to the petitioning researcher from Google Drive.
The full-text corpus data comes with a series of restrictions, provided by Mark Davies. Researchers must acknowledge having read the restrictions and agree to abide by them. The restrictions include: