Text Mining: ProQuest TDM Studio

An introduction to text mining resources at Columbia University

ProQuest TDM Studio

Columbia University Libraries has a license providing current affiliates free access to ProQuest's TDM Studio, a web-based portal into doing TDM research using ProQuest's many databases of full-text sources. TDM Studio has two flavors:

  1. Visualization is available to any Columbia affiliate and requires only registration with ProQuest. Appropriate for newcomers to TDM and programming, Visualization lets the researcher build a corpus of relevant ProQuest documents, such as dissertations, newspaper articles, or books, and then provides them with two options to visualize the datasets:
    1. Maps that show the results of named-entity recognition of the toponyms in the corpus
    2. Tables that cluster the corpus into various "topics," determined algorithmically and featuring keywords in common, a process called "topic modelling."
  2. Workbench is available to any Columbia affiliate but requires meeting with Research Data Services (email data@library.columbia.edu to set up an appointment) first and a pre-existing knowledge of programming in either Python or R. As with Visualization, Workbench allows the researcher to build a corpus of relevant ProQuest documents, but then they are provided with a Jupyter Notebook environment where they can pursue nearly any mode of TDM analysis possible with either programming language.

Creating a Dataset in TDM Studio

In TDM Studio, corpora are referred to as "datasets," and constructing one is the first step in using TDM Studio.

Important Limitations:

  1. A dataset can contain no more than two million documents (newspaper articles, dissertations, books, etc.).
  2. Workbenches can only have ten datasets available at any given time (five for Visualizations), although once datasets are imported into the Jupyter notebook, they can be safely discarded from the list of ten.

Similar to using regular ProQuest search, in TDM Studio, the researcher:

  • Selects the relevant ProQuest databases or publication titles (more than one can be selected),
  • Refines the initial dataset by filtering based on type, date, or other metadata, and/or using specific search terms, including with ProQuest's field codes, and
  • Creates the dataset.

The dataset will take a few minutes to be generated, but once it is available, the researcher can access it from their TDM Studio Jupyter environment.

Using the TDM Studio Jupyter Notebook Environment

Once a researcher has created a dataset, after a few minutes they can set up a Jupyter Notebook environment that has unique access to their dataset(s).

Important Limitations:

  1. The virtual machine provisioned by ProQuest has 16gb of RAM100gb of hard drive space, and 4 CPUs.
  2. The virtual machine also only allows 15mb of downloads per week, to prevent downloading datasets.
  3. The environment can crash if the researcher uses Jupyter Notebook to browse the dataset. They should use the embedded terminal, instead.

ProQuest is somewhat flexible on these limitations, but relaxing them requires having RDS mediate between the researcher and ProQuest.

The default Jupyter Notebook includes helpful tutorials from ProQuest about accessing the TDM Studio datasets. Technologically, the process involves copying the dataset to an AWS S3 bucket to which the Jupyter Notebook has access. Once the dataset is copied to the bucket, it can be deleted from the TDM Studio Workbench Dashboard, freeing up one of the ten slots for datasets.

The dataset takes the form of a folder that contains all of the documents, each an XML file with rich metadata.

For Python users, the environment has access to all of the packages that are available for install via conda. Custom libraries can also be uploaded into the environment.

Research Data Librarian