Digital Humanities Guide: Capture

Capture

Most digital humanities work involves the use of preexisting digital or analog resources.  Depending on their format, a variety of strategies for identifying and capturing them exist.


DISCOVERING: FINDING AND RETRIEVING MATERIAL

Columbia Libraries' website brings together a uniquely strong collection of resources for humanities research, supplementing the vast array of publicly accessible material on the Internet. The links below can help you to search and retrieve the information and sources you need from both of those places most effectively.

  • Identifying the Databases and Datasets You Need on Columbia Libraries' website
  • Searching across many databases on Columbia Libraries' website
  • Best Practices
  • Tools for Searching
  • Databases ready for DH and ones requiring further extraction work

GATHERING: LARGE-SCALE OR AUTOMATED RETRIEVAL

For large-scale projects, the standard interfaces for retrieval may be too slow and time-consuming. We are happy to assist you in exploring other mass mining and capture option, including ones that can scrape and harvest content from the web in a more automatic fashion.

  • APIs
  • Specialized environments with , such as the HATHITrust Research Lab
  • Using Python or other programming languages for scraping

IMAGING: SCANNING AND OCR

At the moment our scanners are not available to the public. You can access OCR software virtually through CUIT.

 

RECORDING: DIGITAL CAPTURE OF SOUND OR MOTION

The DHC does not currently provide equipment for the recording and digitization of audio and video content, although we are exploring options for such a service. (Undergraduate film majors wanting to reserve equipment for their projects can make arrangements with the Film Department.) In the meantime, we can offer some recommendations.

TRANSCRIPTION: INPUTTING TEXT

While manuscripts and poorly printed material can be scanned, they are unlikely to yield useful OCR. (Note, however that large masses of OCRed text with large amounts of error can sometimes lend themselves to certain types of text mining and analysis. Consult with staff to learn more.) If very accurate digital text is required, however, you will need to take advantage of a transcription tool like the ones listed below. Reliable tools for turning spoken speech into printed digital text are services provided by private companies outside Columbia.

CONVERSION: CHANGING FROM ONE FORMAT TO ANOTHER

At Columbia Libraries we support a number of programs used to change text, image, audio, and video files from one format to another. Some of the most powerful are noted below. It is also worth noting that if texts are created or saved in certain standard marked up formats, such as XML, or markdown, they can easily be output in a variety of different formats. We can also consult on unique scripted solutions for your needs.

  • Text: Pandoc
  • Image: Adobe Photoshop, GIMP, ImageMagick
  • Audio: Audacity, ProTools
  • Video: Compressor (a component of FinalCut), VLC
  • Unique data conversions: Python, Ruby