Text mining social media presents special challenges to researchers. At the same time, the social media platforms lure researchers in with their breadth and access to the current Zeitgeist.
There is no "too early" to begin building a corpus of social media. Posts can get deleted and pages can have their permissions changed. Simply put, the very ephemerality of social media that makes it so appealing presents challenges for capturing it. If timing were not enough of a challenge, every platform has different requirements and rules. What follows is an incomplete list of platforms with guides towards building corpora with them.
Facebook prohibits automated collection of data explicitly both in their robots.txt file and in their site scraping terms of service. Researchers, however, can apply for permission to collect information from public Facebook groups and pages using the application CrowdTangle. Researchers at Columbia have successfully applied for access to CrowdTangle in the past, and Research Data Services can help with the application process. CrowdTangle can also help with research on Instagram and Reddit.
Unlike their parent company, Facebook, Instagram has no explicit restriction on site scraping in their robots.txt. Additionally, there is the undocumented __a=1 API that can return information about hashtags or public users' posts, however Instagram will quickly start rejecting such requests. As such, we can infer that Instagram disapproves of automated collection of data. Like with Facebook, however, researchers, can apply for permission to collect information from Instagram using the application CrowdTangle. Researchers at Columbia have successfully applied for access to CrowdTangle in the past, and Research Data Services can help with the application process. CrowdTangle can also help with research on both Facebook broadly and Reddit.
Reddit has a public API that researchers can use to compile corpora of subreddits. Researchers must first read the terms of service and apply for access to the API. Researchers can also apply for permission to collect some information from Reddit using the application CrowdTangle. Researchers at Columbia have successfully applied for access to CrowdTangle in the past, and Research Data Services can help with the application process. CrowdTangle can also help with research on both Facebook an Instagram.
Twitter offers an academic research API for researchers who need full access to Twitter for their research. Access to this API is only available by application. Alternatively, Twitter also offers a standard API that is more limited. In our experience, the standard API only accesses a sample of the Twitter data, so using it to find specific tweets may not work. Additionally, there are open source tools from Documenting the Now that can help researchers build an archive of tweets in real time as they appear on Twitter (like collecting tweets from a specific hashtag). Given the limitations of the standard API, we recommend researchers start collecting tweets as soon as possible.
Youtube provides a public API that researchers can use to build corpora of videos and comments. Furthermore, Youtube has a GitHub repository with code samples using the API in a variety of languages, including Python (very detailed), Ruby, and JavaScript. Alex, from CuriousGnu, has put together a blog post that indicates some of what's possible with the public API.