Text Mining, also called "text data mining (TDM)" or "textual data mining," joins numerical and geospatial data analysis among the data-intensive modes of inquiry that power cutting-edge scholarship.
Getting started in text mining is difficult, and many projects quickly encounter licensing or copyright limitations. Just because writing a script that can scrape the contents of websites is easy does not mean that it falls within acceptable use, and abusing Columbia licenses to scrape licensed content could lead the researcher into trouble.
Roughly speaking, text mining is made up of three large steps:
This guide will give suggestions for approaching each of these steps in turn.
TL;DR: Email Research Data Services (data@library.columbia.edu
) to set up a consultation for your text mining research project.
Oftentimes building a corpus occurs in tandem with research design. After all, "I want to mine the contents of Company X's internal emails to see how their organizational structure evolved" is only viable as a research program with access to Company X's internal emails. Additionally, though certain content, like all of the articles ever published in Journal X, might appear publicly available, grabbing all of the articles for building a corpus may run against acceptable use policies.
We recommend involving the team at Research Data Services in research design. We can help you find what sorts of corpora are already available, and we can also help with determining what is acceptable for use in a corpus and what is not. Email us at data@library.columbia.edu
to set up a consultation.
Data scientists will often publish their corpora for scholarly re-use by researchers like you. Similarly, texts in the public domain often make for tantalizing corpora for text mining. Here are a few links to resources in acquiring corpora:
Often researchers are interested in corpora that don't exist yet. We get frequent requests from researchers who want to mine a specific publication in a specific way that has never been done. In these cases, the corpus is often built by the researcher, relying on licensed materials.
While the Libraries will not pay to acquire a corpus for a specific research project, publishers may be willing to provide the desired text in some format, and the fees can be incorporated into grant funding requests. Alternatively, researchers can make use of specific text mining tools some vendors are now making available to mine their licensed content. This is a rapidly changing field, and the information below may be out of date. Contact RDS (data@library.columbia.edu
) for a consultation.
Once you have a corpus, of course, you want to "mine" it or analyze it. Typically research questions regarding TDM boil down to one of two questions about classification:
The appropriate mode of analysis depends on the research question. Then, within each branch of learning, there are many different choices to make along the way.
Unlike corpora, which can be guarded by paywalls or other subscription models, most analysis tools are built upon (or are themselves) freely available, open-source software. For us at RDS, most people coming to us with questions about TDM anticipate doing their analysis using software relying on either a Python or R programming environment, with Python the clear leader. However, these are not the only two options for researchers. For example:
Similarly, companies that manage access to corpora are beginning to provide their own, in-house environment to TDM researchers who want to use their tools. These may involve extra cost to the researcher. For example:
Once the computer is done mining the corpus, the researcher has results to interpret and fit into scholarly discussion. Often, these results can take a graphical form as charts, graphs, maps, or other visualizations. RDS does not provide help with interpretation per se, but we can help with various tools for visualization, like:
For publishing and sharing, RDS hands expertise over to our colleagues in Scholarly Communications at the Libraries.