Digital Humanities Guide: Enrichment

Enrichment

Enrichment

Before you can make the fullest use of text you have captured or created, you may need to correct errors, enhance quality, add markup to identify structural elements, and add description or commentary by adding metadata or annotations. These processes are addressed below under the headings of of cleanup, editing, and annotation.   

CLEANUP

The data you retrieve from digital collections at Columbia or elsewhere on the web or produce by photography, recording, scanning, or OCR will often not be of the quality you need for study or presentation, particularly if you are interested in using tools for textual or other content analysis.  There may be many errors in the text and layout, images or media files may not be in the optimal form for presentation or need to be combined with one another, or the material may not be divided into the units you need for your work.  A variety of tools are available for cleaning up text, numeric, image, and media files.

  • Text
    • Abbyy FineReader creates a collection of OCR'ed text linked to images of the pages on which it occurs, and can assist in cleanup thanks to its highlighting of uncertain readings,  ability to zoom in on a particular section of the page image, and relatively robust set of editing and search tools.  The software also provides a range of tools for editing images, manually creating zones for reading,  training OCR of unfamiliar typefaces or languages, formatting output, and applying templates that can result in more accurate textual output.
    • Wordprocessors, Text Editors, and Python: If you need to identify and/or apply specific text formatting  (bold, italics, font size, etc.), Microsoft Word (Mac and PC) may actually provide the best solution. For large replace and cleanup tasks, you may want to use plain text editors like Visual Studio Code, or Python scripts.  For all of these, knowledge and use of regular expressions is most valuable.
  • Datasets: for regularizing and correcting data in tabular or spreadsheet form, the best alternative is probably Open Refine, a free tool maintained by Google.
  • Audio files: In addition to the tools listed below for Audio in the section on Editing, you may also want to explore what else may be available at the Digital Music Lab.
  • Video files: While some basic refinement of video files can be be performed by Adobe Premier, available at Butler 305 workstations.

EDITING

  • Text: Visual Studio Code, Microsoft Word, Abbyy FineReader
  • Image -- Photoshop, Gimp
  • Audio -- Audacity (PC), ProTools (Mac), and SoundTrack Pro (Mac)
  • Video -- FinalCut, Premiere, Avid, and related tools

ANNOTATION

  • XML, or Extensible Markup Language, a more recent iteration of an earlier standard known as SGML (Standard Generalized Markup Language),  is a set of rules for creating elements to describe the structure and content of files (as well as to perform other tasks with them) by marking them up with sets of tags.  It is employed employed in web pages and many scholarly and commercial electronic texts.  Its most successful implementation is HTML, the tag set that enables the Web to function.   Another very important standard is TEI (Text Encoding Initiative), a set of elements designed for marking up texts for humanities and history scholars.  Presentation of XML texts can be handled either by XSL (Extensible Stylesheet Language) or CSS (Cascading Stylesheets), while XSLT (Extensible StyleSheet Language for Transformation) is a programming language that can manipulate XML texts in many ways to extract information or to output the text in various formats.  XML and XSL tags can 
  • TeX, a powerful open source typesetting system favored by many in the sciences and engineering.  Particularly useful for dealing with texts containing complex layout, such as formulas.
  • Markdown, a simple markup system designed to enable plain text to be quickly and easily ported to other formatting systems, including HTML, pdf, etc.
  • hypothes.is, a tool used to annotate web content directly on a browser.