Text Mining: Text Sources

An introduction to text mining resources at Columbia University

Text Sources

Below you will find a list of sources of texts that Columbia affiliates may use for text mining purposes. Every source has different rules associated with it, so when in doubt, please reach out to Research Data Services (data@library.columbia.edu) for more information.

Publishers have different rules and regulations. They also make their datasets available in different formats, depending on the publisher.

Please remember that this is an emerging, fast-moving field, so information on this page may fall out of date rather quickly.

The information presented here was largely gathered by Colleen Major, Head of Electronic Resources Management: Operations and Analysis.

  • Adam Mathew
    Primary sources in humanities and social sciences. (list of databases in Clio)

    Adam Mathew allows for TDM via an API. Contact RDS.

  • American Chemical Society
    Journals in chemistry.

    ACS allows for fulltext TDM. Contact RDS.

  • BioOne
    Non-profit scientific research.

    BioOne allows for TDM. Contact RDS.

  • Cambridge University Press
    Academic journals and books.

    Cambridge University Press allows for TDM. Contact RDS.

  • Elsevier
    Citation metadata for publications in health, physical, and social sciences.

    Elsevier allows for TDM via an API Key tied to Columbia’s license. See Elsevier’s TDM policy and contact RDS with questions.

  • Gale
    Newspapers, magazines, and religious, historical, and social scientific materials. (list of databases in Clio, search for primary sources)

    Gale will share fulltext content for TDM with Columbia Libraries upon request. Contact RDS.

  • JSTOR
    Journals and books in the humanities and social sciences.

    JSTOR’s fulltext TDM is now supported by the Constellate platform. Contact RDS for details.

  • ProQuest
    Newspapers, magazines, journals, and books in the humanities and social sciences. (list of databases in Clio)

    ProQuest allows for fulltext TDM via ProQuest TDM Studio, a virtual environment. Contact RDS for details.

  • Oxford University Press
    Journals and books in the humanities and social sciences.

    OUP allows for TDM. Contact RDS.

  • SAGE
    Journals and books in business, humanities, social sciences, science, technology, and medicine.

    SAGE allows for TDM that follows their guidelines. They also have datasets available via Data Planet. Contact RDS with questions.

  • Springer Nature
    Journals in science.

    Springer Nature provides a limited API for TDM, but allows for fulltext TDM via an API key belonging to Columbia. Contact RDS.

  • Taylor & Francis
    Scholarly journals.

    Taylor and Francis allows for TDM. Contact RDS.

  • Web of Science
    Citation metadata for articles in the sciences.

    Web of Science allows for TDM via an API. Contact RDS with questions.

  • Wiley
    Journals and books in science, technology, medicine, professional development, and higher education.

    Wiley allows for fulltext TDM via an API. Contact RDS with questions.