Text Mining: Social Media

An introduction to text mining resources at Columbia University

Social Media

Text mining social media presents special challenges to researchers. At the same time, the social media platforms lure researchers in with their breadth and access to the current Zeitgeist.

There is no "too early" to begin building a corpus of social media. Posts can get deleted and pages can have their permissions changed. Simply put, the very ephemerality of social media that makes it so appealing presents challenges for capturing it. If timing were not enough of a challenge, every platform has different requirements and rules. What follows is an incomplete list of platforms with guides towards building corpora with them.

Social media researchers may also benefit from keeping the SOMAR (Social Media Archive hosted by ICPSR) in mind. It hosts already existing datasets from social media and makes new ones available for sharing, meaning your corpus may already exist or you may have a place to deposit your corpus once you have completed your research.

Facebook

Facebook prohibits automated collection of data explicitly both in their robots.txt file and in their site scraping terms of service. Researchers, however, can apply for permission to access the Meta Content Library and Content Library API. These two, related products replace CrowdTangle. Meta Content Library is a web-based tool that does not permit data to be exported. Researchers eager to build corpora should use the Content Library API, which is accessed through a secure virtual data enclave. Applications to both products are independently reviewed by SOMAR/ICPSR, and Meta has provided documentation for both products. We librarians do not yet have experience with either product.

 

Instagram

Like with Facebook, researchers, can apply for permission to access the Meta Content Library and Content Library API. These two, related products replace CrowdTangle. Meta Content Library is a web-based tool that does not permit data to be exported. Researchers eager to build corpora should use the Content Library API, which is accessed through a secure virtual data enclave. Applications to both products are independently reviewed by SOMAR/ICPSR, and Meta has provided documentation for both products. We librarians do not yet have experience with either product.

 

Reddit

Reddit does allow use of their data for non-commercial research purposes. As of 2023, the rules have changed somewhat. Nevertheless, you can contact Reddit to request research access

 

Twitter/X

Twitter/X has what I consider to be an antagonistic relationship with academic research, which is a newer phenomenon. In the past, it was reasonably straightforward for researchers to gain access to large chunks of the archive. Today, however, any kind of programmatic access to pulling tweets (as opposed to just posting new ones) requires a paid subscription. Additionally, Twitter’s search, as they explain clearly, focuses on relevance and not completeness and relies on a sample of all recent tweets, meaning that building a comprehensive, historical archive of tweets on any particular topic is practically impossible without access to the top subscription tier. The popular python library Tweepy works with the new API, but it does not provide a way around having to pay for API access.

 

Youtube

Youtube provides a public API that researchers can use to build corpora of videos and comments. Furthermore, Youtube has a GitHub repository with code samples using the API in a variety of languages, including Python (very detailed), Ruby, and JavaScript. Alex, from CuriousGnu, has put together a blog post that indicates some of what's possible with the public API.