Text Mining: Home

An introduction to text mining resources at Columbia University

Text Mining, also called "text data mining (TDM)" or "textual data mining," joins numerical and geospatial data analysis among the data-intensive modes of inquiry that power cutting-edge scholarship. 

Getting started in text mining is difficult, and many projects quickly encounter licensing or copyright limitations. Just because writing a script that can scrape the contents of websites is easy does not mean that it falls within acceptable use, and abusing Columbia licenses to scrape licensed content could lead the researcher into trouble.

Roughly speaking, text mining is made up of three large steps:

  1. Gathering the text to be mined, or building the corpus
  2. Mining the text, or analyzing the corpus
  3. Interpreting, publishing, and sharing the results of the analysis

This guide will give suggestions for approaching each of these steps in turn.

TL;DR: Email Research Data Services (data@library.columbia.edu) to set up a consultation for your text mining research project.

Building a Corpus (Gathering Text Data)

Oftentimes building a corpus occurs in tandem with research design. After all, "I want to mine the contents of Company X's internal emails to see how their organizational structure evolved" is only viable as a research program with access to Company X's internal emails. Additionally, though certain content, like all of the articles ever published in Journal X, might appear publicly available, grabbing all of the articles for building a corpus may run against acceptable use policies.

We recommend involving the team at Research Data Services in research design. We can help you find what sorts of corpora are already available, and we can also help with determining what is acceptable for use in a corpus and what is not. Email us at data@library.columbia.edu to set up a consultation.

Freely Available Corpora

Data scientists will often publish their corpora for scholarly re-use by researchers like you. Similarly, texts in the public domain often make for tantalizing corpora for text mining. Here are a few links to resources in acquiring corpora:

Licensed Corpora

Often researchers are interested in corpora that don't exist yet. We get frequent requests from researchers who want to mine a specific publication in a specific way that has never been done. In these cases, the corpus is often built by the researcher, relying on licensed materials. 

While the Libraries will not pay to acquire a corpus for a specific research project, publishers may be willing to provide the desired text in some format, and the fees can be incorporated into grant funding requests. Alternatively, researchers can make use of specific text mining tools some vendors are now making available to mine their licensed content. This is a rapidly changing field, and the information below may be out of date. Contact RDS (data@library.columbia.edu) for a consultation.

  • TDMStudio: Brought to us by ProQuest, TDMStudio allows researchers to develop massive datasets using ProQuest's holdings of contemporary and historical newspapers (with a bias towards American publications), literature, dissertations, and other academic publications. Contact RDS to see about gaining access to a TDMStudio workbench.
  • Eighteenth Century Collections Online (ECCO): The Libraries have access to the full ECCO set of scanned documents and OCR'ed text available for researchers pursuing TDM. Contact RDS for more information.
  • CrowdTangle: Provided by Facebook, CrowdTangle allows researchers to amass large corpora of posts to Facebook pages and groups as well as posts from Instagram and Reddit. Contact RDS for more information.

Analyzing a Corpus (Mining the Text)

Once you have a corpus, of course, you want to "mine" it or analyze it. Typically research questions regarding TDM boil down to one of two questions about classification:

  1. "If I feed the computer a random document, based on my corpus, will the computer guess that it is of type A or type B (or type C...)?" In this case, the various types can be anything from different genres, eras in history, specific authors, or whatever is salient to the researcher. Because this method relies on training the computer ahead of time on a portion of the corpus, it is called supervised learning.
  2. "If I feed my whole corpus into the computer and ask it to split the corpus up into n groups, what do those groups look like?" In this case, the computer does not know ahead of time what the categories inherent in the corpus are. The researcher simply asks it to generate n categories, where can be as small as 2 or 3 or as large as 50 or more. Because this method relies on the computer working on the corpus with no prior knowledge of it, it is called unsupervised learning

The appropriate mode of analysis depends on the research question. Then, within each branch of learning, there are many different choices to make along the way.

Tools for Analysis

Unlike corpora, which can be guarded by paywalls or other subscription models, most analysis tools are built upon (or are themselves) freely available, open-source software. For us at RDS, most people coming to us with questions about TDM anticipate doing their analysis using software relying on either a Python or R programming environment, with Python the clear leader. However, these are not the only two options for researchers. For example:

  • Voyant is a web-based software package that presents many familiar text analysis techniques in a relatively straightforward, graphically rich environment. 

Similarly, companies that manage access to corpora are beginning to provide their own, in-house environment to TDM researchers who want to use their tools. These may involve extra cost to the researcher. For example:

  • ProQuest TDM Studio provides access to the various ProQuest databases Columbia subscribes to and then offers a web-based Python or R environment for researchers to mine the content without running afoul of licensing issues. It is free to all current researchers affiliated with Columbia, but requires contacting RDS first.

Interpreting, Publishing, and Sharing the Results of the Analysis

Once the computer is done mining the corpus, the researcher has results to interpret and fit into scholarly discussion. Often, these results can take a graphical form as charts, graphs, maps, or other visualizations. RDS does not provide help with interpretation per se, but we can help with various tools for visualization, like:

  • Voyant. Mentioned above, Voyant provides various visualizations as part of the process of analysis, making it simple for researchers to create word clouds, maps, or charts that demonstrate various features of the corpus.
  • Matplotlib is the standard Python visualization library.
  • Ggplot2 is the standard R visualization library.
  • For interactive, web-based visualizations, RDS can help researchers using JavaScript libraries like D3.

For publishing and sharing, RDS hands expertise over to our colleagues in Scholarly Communications at the Libraries.

Research Data Librarian