Research Guides: Linguistics Research Guide: Full-Text Corpus Data

Library

Intro

Full-Text Corpus Data

Columbia University Libraries has licensed the use of several corpora in English, Spanish, and Portuguese, often colloquially called “the Davies corpora.” This includes the Corpus of Contemporary American English (COCA) and Corpus of Historical American English (COHA).

The Corpora

English Corpora

This information comes from english-corpora.org, the home of the web interface to the following corpora. Columbia makes all of these corpora available for download for researchers, as indicated below.

Corpus	Overview	Size	Dialect	Time period	Genre(s)
News on the Web (NOW)	Overview	20.3 billion+	20 countries	2010–yesterday	Web: News
iWeb: The Intelligent Web-based Corpus	Overview	14 billion	6 countries	2017	Web
Global Web-Based English (GloWbE)	Overview	1.9 billion	20 countries	2012–13	Web (incl blogs)
Wikipedia Corpus	Overview	1.9 billion	(Various)	2014	Wikipedia
Coronavirus Corpus	Overview	1.5 billion	20 countries	2020–2023	Web: News
Corpus of Contemporary American English (COCA)	Overview	1.0 billion	American	1990–2019	Balanced
Corpus of Historical American English (COHA)	Overview	475 million	American	1820–2019	Balanced
The TV Corpus	Overview	325 million	6 countries	1950–2018	TV shows
The Movie Corpus	Overview	200 million	6 countries	1930–2018	Movies
Corpus of American Soap Operas	Overview	100 million	American	2001–2012	TV shows

El corpus del español

Corpus	Overview	Size	Created
Web / Dialects	Overview	2 billion words	2016

O corpus do português

Corpus	Overview	Size	Created
Web / Dialects	Overview	1 billion words	2016

Formats

File Formats

The Libraries make the above corpora available for download. Each corpus is made up of at least five files (original documentation):

File	Format	Columns
`<corpus>-db.tar`	SQL-style database	`textID`, `ID` (1 - n), `wordID` (link to lexicon)
`<corpus>-wlp.tar`	Word, lemma, and part of speech in vertical format	`textID`, `ID` (1 - n), `word`, `lemma`, `PoS`
`<corpus>-text.tar`	Each entire text on its own line line.	`textID`, `text`
`<corpus>-lexicon.zip`	Lexicon table for use with db.	`wordID` (link to database), `word`, `lemma`, `PoS`
`<corpus>-sources.zip`	Corpus-specific table of sources.	`textID`, `#words`, and other column

Each tar or zip file will typically contain several text files. Above, each unique text has a textID identified in the sources file and linked to from one of the corpus files. The wordID identifies each unique word, with its lemma and part of speech and is linked to from the db and wlp files.

Access

Accessing the Data

Researchers must read and agree to the restrictions over the data after contacting data@library.columbia.edu. For the time being, Research Data Services will then make the files available for download to the petitioning researcher from Google Drive.

Restrictions

The full-text corpus data comes with a series of restrictions, provided by Mark Davies. Researchers must acknowledge having read the restrictions and agree to abide by them. The restrictions include:

In no case can substantial amounts of the full-text data (typically, a total of 50,000 words or more) be distributed outside the organization listed on the license agreement [Columbia University]. For example, you cannot create a large word list or set of n-grams, and then distribute this to others, and you could not copy 70,000 words from different texts and then place this on a website where users from outside your organization would have access to the data.
The data cannot be placed on a network (including the Internet), unless access to the data is limited (via restricted login, password, etc) just to those from the organization listed on the license agreement. For example, it cannot be placed on another corpus site, which indexes the data and then makes it available to end users, because that other corpus site would then have access to the data.
In addition to the full-text data itself, #2 also applies to derived frequency, collocates, n-grams, concordance, and similar data that is based on the corpus.
If portions of the derived data are made available to others, they cannot include substantial portions of the raw frequency of words (e.g. the word occurs 3,403 times in the corpus) or the rank order (e.g. it is the 304th most common words). (Note: it is acceptable to use the frequency data to place words and phrases in “frequency bands,” e.g. words 1–1000, 1001–3000, 3001–10,000, etc. However, there should not be more than about 20 frequency bands in your application.)
Academic licenses: are only valid for one campus. So if you are part of a research group, for example, with members at universities X, Y, and Z, they all need to purchase the data separately.
Academic licenses: you cannot use the data to create software or products that will be sold to others.
Academic licenses: students in your undergraduate classes cannot have access to substantial portions of the data (e.g. 50,000 words or more). Graduate students can have access to the data for work on theses and dissertations. The data is primarily intended for use in research, not teaching. If you need corpus data for undergraduate classes, please use the standard web interface for the corpora.
Academic and Commercial licenses: supervisors will make best efforts to ensure that other employees or students who have access to the data are aware of these restrictions.
[Not Applicable to Columbia] Commercial license: large companies with employees at several different sites (especially different countries) may need to contact us for a special license.
There are no refunds, unless you find that the samples that you have examined are not representative of the “full” data that you download (which they are not).
Any publications or products that are based on the data should contain a reference to the source of the data: https://www.corpusdata.org. (See also: https://www.english-corpora.org/faq.asp#cite).
Note that a small, unique change will be made to each set of data, and this will serve as a “fingerprint” to identify you as the unique source of the datasets that you download. Automated Google searches are run daily to find copies of the data on the Web. If we find the data online and it is the data that was sent to you (and we will be able to determine that is the case), then you will be required to contact the administrators for that website, to have the data removed.