Text Mining: TDM Studio HOWTOs

An introduction to text mining resources at Columbia University

Introduction

Now that you’ve got your TDM Studio workbench all set up, you perhaps want some guidance on how to use this service. Luckily, the documentation provided by ProQuest is pretty good. But below we list a few tips that we picked up when using TDM Studio in our own research. They may be covered in the documentation or not, but their ultimate goal is to help you, the researcher, get used to using a Jupyter environment you cannot control.

Opening the Workbench Refresher

After building a dataset at https://tdmstudio.proquest.com/workbenchdashboard, you will see in the top right corner a grayed-out button “Open Jupyter Notebook” and, below it, a toggle set to “Off.” Flip that toggle, wait up to ten minutes, and the “Open Jupyter Notebook” button will become active and clickable for access to your workbench.

Environment Refresher

TDMStudio workbench is a virtual Windows machine deployed and controlled by ProQuest. It loads your selected datasets that you produce at https://tdmstudio.proquest.com/workbenchdashboard into that environment automatically. Your home directory in the context of the shell is /home/ec2-user/, but in the context of the Jupyter environment, it’s /home/ec2-user/SageMaker/. Hence, to go into your data folder in the terminal, you use cd SageMaker/data/, while data is a root-level folder in Jupyter. Do not open the data folder in Jupyter. Always use the terminal.

Within Jupyter, you are in an Anaconda regime, meaning you use conda install and not pip install.

The environment is fond of kicking you out, either from the web interface or shutting down the machine entirely. The web interface will close rather quickly, but any notebooks you have running will persist. However, eventually (after 48 hours), the virtual computer will shut itself down, too, stopping any scripts you had running as well. We address this below.

Finally, you should assume that the virtual machine is disconnected from the Internet. There are a very small number of actions you can execute that rely on the Internet (or give the impression thereof), but generally you are out of luck. You cannot use requests to download data to the VM from GitHub, for example. We address this, in part, below as well.

Opening the Terminal Refresher

Since we are relying on the terminal in what follows, you open the terminal by selecting ”New > Terminal” from the dropdown in the top right of the Jupyter homescreen:

A dropdown menu showing how to open a terminal in Jupyter.

And now, on to the tips.

Create Your Own Kernel ASAP

The TDM Studio environment comes packed with a bunch of Python and R kernels, including several default kernels provided by Amazon. Changes to these kernels do not persist, and though they may come loaded with most of the libraries you need, they may be relying on outdated versions of those libraries. We found particular issues around a version of pandas not new enough to support parquet files, but when you read this, the list of fussy libraries may be different. Luckily, it’s possible to build your own kernel that persists across reboots of your TDM Studio Workbench. To create a kernel:

  1. Open a terminal inside the Jupyter environment (see above)
  2. Run cenv python my_kernel or whatever you want to choose for your kernel name. Alternatively, you can use R instead of python to create an R kernel.
  3. Now, after refreshing the Jupyter home page, when you create a new notebook in the “New” dropdown menu, your kernel should appear.
  4. Add new libraries with %conda install <library> inside a Jupyter notebook, not in the terminal.

A good first cell for your first notebook with your new kernel:

%conda update -n base -c conda-forge conda
%conda update --all
%conda install lxml
%conda install beautifulsoup4
%conda install pyarrow
%conda install tqdm

This cell:

  1. Updates conda to the newest version.
  2. Updates all packages to their newest versions.
  3. Installs lxml, an XML parser that is used in ProQuest’s demo notebooks.
  4. Installs BeautifulSoup, which pulls data from XML files and is also used in ProQuest’s demo notebooks.
  5. Installs, perhaps superfluously (because it gets installed as part of the pandas upgrade in step 2), pyarrow, for working with parquet files.
  6. Installs tqdm, which provides access to tqdm.notebook, for helpful progress bars in your notebooks.

Once these are all installed, they will be available in any future notebook with your new kernel, obviating the need to run these commands with every restart of the VM.

Think in Batches

Because the VM will auto-shutoff after two days no matter what, you have to think of your work in terms of batches. If you are processing millions of documents, it makes sense to chunk the analysis into 10,000-document-long chunks that are saved as their own parquet files (this is one reason we are adamant about updating pandas to use parquet files). A reasonable workflow, with code, could look like this:

Build a Gold Standard List of Filenames to Consult When Iterating

  1. Get a list of your documents.

    import os
    
    dataset_name = "YOUR_DATASET_NAME" # Use the dataset name you gave when you created it with your search.
    corpus_path = f"/home/ec2-user/SageMaker/data/{dataset_name}/" 
    files = os.listdir(corpus_path)
    
  2. Convert the list of files to a dataframe limited to files ending in “.xml.”

    import pandas as pd
    
    xml_files = [file for file in files if file.endswith('.xml')]
    df = pd.DataFrame(xml_files, columns=['Filename'])
    
  3. Shuffle the filenames so when iterating through the files, you are sampling uniformly as you iterate.

    df = df.sample(frac=1, random_state=42).reset_index(drop=True)
    df = df.reset_index()
    

    This also creates a new index column, as the pandas index column is not exported when saving to parquet.

  4. Save the shuffled file list as a parquet file you will consult when iterating over millions of documents.

    df.to_parquet("shuffled_filenames_with_indices.parquet", index=False)
    

    What we end up with is a dataframe with two columns, “index” and “Filename.” The former will act as a cursor in the future, and the latter indicates which file we have to read next.

Use a Checkpoint System to Handle the VM’s Auto-Shutoff

  1. Set some constants.

    batch_dir = "batches"
    batch_size = 10000 # How many documents to process at once
    checkpoint_path = "checkpoint.txt" # The file holding the checkpoint information
    
  2. See if the batches/ directory already exists and, if not, create it.

    os.makedirs(batch_dir, exist_ok=True)
    
  3. See if there already is a checkpoint set, else start iterating at 0.

    if os.path.exists(checkpoint_path):
        with open(checkpoint_path, "r") as f:
            start_batch = int(f.read().strip())
    else:
        start_batch = 0
    
  4. Using start_batch and batch_size, iterate over the file list data frame.

    from tqdm.notebook import tqdm
    
    for start in range(start_batch * batch_size, len(df), batch_size):
        end = start + batch_size
        batch = df.iloc[start:end].copy()
        batch_number = start // batch_size + 1
        
        print(f"Processing batch {batch_number}: rows {start} to {end - 1}")
        
        try:
            for index, row in tqdm(batch.iterrows()):
                filename = row['Filename']
                
                # Process the file
    
            batch_file = os.path.join(batch_dir, f"analyzed_batch_{batch_number}.parquet") 
            batch.to_parquet(batch_file, index=False)
            with open(checkpoint_path, "w") as f:
                f.write(str(batch_number))
        
        except Exception as e:
            print(f"Error processing batch {batch_number}: {e}")
            break
    

Work on a Concatenated DataFrame

Read all the batch files and concatenate them into one dataframe for further analysis or export.

batch_prefix="analyzed_batch_"

batch_files = sorted(
    [
        os.path.join(batch_dir, f)
        for f in os.listdir(batch_dir)
        if f.startswith(batch_prefix) and f.endswith(".parquet")
    ]
)
    
if not batch_files:
    raise ValueError(f"No batch files found in {batch_dir} with prefix '{batch_prefix}'")
    
all_batches = [pd.read_parquet(f) for f in batch_files]
concatenated_df = pd.concat(all_batches, ignore_index=True)

Hugging Face without the Internet

Many data-intensive libraries, like NLTK or Huggingface’s Transformers, rely on Internet connections to download datasets or models. This makes it difficult to use these libraries on systems like TDM Studio, which provides a virtual machine effectively disconnected from the Internet. For example, NLTK’s Punkt tokenizer is already loaded in the TDM Studio VM, so that you can use it without having to use the  nltk.download() function, which relies on an Internet connection.

Hugging Face’s Transformers library, however, downloads models from the Internet and then saves them with obscure names in a cache directory—when the Internet is even working. So on TDM Studio, the process is a bit more complex and requires initial work on a computer with the Internet. For the purposes of this guide, let’s assume you are working with Google’s BERT Base Model (Uncased).

  1. On your local computer, create a Python script that, first, loads the Hugging Face model. Use the language given when you click on “Use this Model > Transformers” on the Hugging Face page for your model of choice.

    from transformers import AutoTokenizer, AutoModelForMaskedLM
    
    tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
    model = AutoModelForMaskedLM.from_pretrained("google-bert/bert-base-uncased")
    
  2. Next, have the script save the model and tokenizer locally on your computer.

    tokenizer.save_pretrained("bert-base-uncased")
    model.save_pretrained("bert-base-uncased")
    

    This saves the pertinent files to the folder bert-base-uncased.

  3. Finally, in the shell, bundle up the folder to prepare it for uploading into TDM Studio.

    tar cf bert-base-uncased.tar bert-base-uncased
  4. Once you have transferred the tarball to TDM Studio, unpack it and then reference it in your notebooks like this.

    from transformers import AutoTokenizer, AutoModelForMaskedLM
    
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
    

Librarian

Library