Text mining social media presents special challenges to researchers. At the same time, the social media platforms lure researchers in with their breadth and access to the current Zeitgeist.
There is no "too early" to begin building a corpus of social media. Posts can get deleted and pages can have their permissions changed. Simply put, the very ephemerality of social media that makes it so appealing presents challenges for capturing it. If timing were not enough of a challenge, every platform has different requirements and rules. What follows is an incomplete list of platforms with guides towards building corpora with them.
Social media researchers may also benefit from keeping the SOMAR (Social Media Archive hosted by ICPSR) in mind. It hosts already existing datasets from social media and makes new ones available for sharing, meaning your corpus may already exist or you may have a place to deposit your corpus once you have completed your research.
Facebook prohibits automated collection of data explicitly both in their robots.txt file and in their site scraping terms of service. Researchers, however, can apply for permission to access the Meta Content Library and Content Library API. These two, related products replace CrowdTangle. Meta Content Library is a web-based tool that does not permit data to be exported. Researchers eager to build corpora should use the Content Library API, which is accessed through a secure virtual data enclave. Applications to both products are independently reviewed by SOMAR/ICPSR, and Meta has provided documentation for both products. We librarians do not yet have experience with either product.
Like with Facebook, researchers, can apply for permission to access the Meta Content Library and Content Library API. These two, related products replace CrowdTangle. Meta Content Library is a web-based tool that does not permit data to be exported. Researchers eager to build corpora should use the Content Library API, which is accessed through a secure virtual data enclave. Applications to both products are independently reviewed by SOMAR/ICPSR, and Meta has provided documentation for both products. We librarians do not yet have experience with either product.
Reddit does allow use of their data for non-commercial research purposes. As of 2023, the rules have changed somewhat. Nevertheless, you can contact Reddit to request research access.
Twitter/X has what I consider to be an antagonistic relationship with academic research, which is a newer phenomenon. In the past, it was reasonably straightforward for researchers to gain access to large chunks of the archive. Today, however, any kind of programmatic access to pulling tweets (as opposed to just posting new ones) requires a paid subscription. Additionally, Twitter’s search, as they explain clearly, focuses on relevance and not completeness and relies on a sample of all recent tweets, meaning that building a comprehensive, historical archive of tweets on any particular topic is practically impossible without access to the top subscription tier. The popular python library Tweepy works with the new API, but it does not provide a way around having to pay for API access.
The boilerplate response to questions about research access in this Developer’s Forum is:
- For general academic research, please enroll in one of our X API access tiers.
- For qualified research under the Article 40 of the Digital Services Act you can apply for X API access here.
This is accompanied by a link to X’s “Do Research” page that has no outgoing links. As I wrote, this is an antagonistic relationship to research. The Digital Services Act is an EU regulation passed in 2022, so any Columbia researchers would need to partner with an EU institution as subordinates.
Youtube provides a public API that researchers can use to build corpora of videos and comments. Furthermore, Youtube has a GitHub repository with code samples using the API in a variety of languages, including Python (very detailed), Ruby, and JavaScript. Alex, from CuriousGnu, has put together a blog post that indicates some of what's possible with the public API.