Text Mining: Text Sources

An introduction to text mining resources at Columbia University

Text Sources

Below you will find a list of sources of texts that Columbia affiliates may use for text mining purposes. Every source has different rules associated with it, so when in doubt, please reach out to Research Data Services (data@library.columbia.edu) for more information.

Publishers have different rules and regulations. They also make their datasets available in different formats, depending on the publisher.

Please remember that this is an emerging, fast-moving field, so information on this page may fall out of date rather quickly.

The information presented here was largely gathered by Colleen Major, Head of Electronic Resources Management: Operations and Analysis.

  • Adam Mathew
    Primary sources in humanities and social sciences. (list of databases in Clio)

    Adam Mathew allows for TDM via an API. Contact RDS.

  • American Chemical Society
    Journals in chemistry.

    ACS allows for fulltext TDM. Contact RDS.

  • BioOne
    Non-profit scientific research.

    BioOne allows for TDM. Contact RDS.

  • Cambridge University Press
    Academic journals and books.

    Cambridge University Press allows for TDM. Contact RDS.

  • EBSCOHost
    Academic journals and other scholarly publications.

    No TDM capabilities for full text databases.

  • Elsevier / Scopus
    Citation metadata for publications in health, physical, and social sciences.

    Elsevier allows for TDM via an API Key tied to Columbia’s license. See Elsevier’s TDM policy and contact RDS with questions. Full text is available only for articles published via ScienceDirect.

  • Gale
    Newspapers, magazines, and religious, historical, and social scientific materials. (list of databases in Clio, search for primary sources)

    Gale will share fulltext content for TDM with Columbia Libraries upon request. Contact RDS.

  • HathiTrust Research Center
    The HathiTrust Digital Library

    HTRC allows for web-based analysis of the HathiTrust Digital Library.

  • IEEE Xplore
    Scientific and technical content published by the IEEE (Institute of Electrical and Electronics Engineers) and its publishing partners..

    IEEE Xplore has an API that Columbia researchers can access. See their Quick Start Guide for more information.

  • JSTOR
    Journals and books in the humanities and social sciences.

    JSTOR’s fulltext TDM is supported both by the Constellate platform and individual requests for data transfers. Contact RDS for details.

  • ProQuest
    Newspapers, magazines, journals, and books in the humanities and social sciences. (list of databases in Clio)

    ProQuest allows for fulltext TDM via ProQuest TDM Studio, a virtual environment. Contact RDS for details.

  • Oxford University Press
    Journals and books in the humanities and social sciences.

    OUP allows for TDM. Contact RDS.

  • SAGE
    Journals and books in business, humanities, social sciences, science, technology, and medicine.

    SAGE allows for TDM that follows their guidelines. They also have datasets available via Data Planet. Contact RDS with questions.

  • Scopus
    See Elsevier, above.

  • Springer Nature
    Journals in science.

    Springer Nature provides a limited API for TDM, but allows for fulltext TDM via an API key belonging to Columbia. Contact RDS.

  • Taylor & Francis
    Scholarly journals.

    Taylor and Francis allows for TDM. Contact RDS.

  • University of Chicago Press Journals
    Journals and books.

    UofC Press allows for TDM. Contact RDS with questions.

  • Web of Science
    Citation metadata for articles in the sciences. Includes citation counts but no full text.

    Web of Science allows for TDM via an API. Contact RDS with questions.

  • Wiley
    Journals and books in science, technology, medicine, professional development, and higher education.

    Wiley allows for fulltext TDM via an API. Contact RDS with questions.