Complete Public Reddit Comments Corpus (2007-2015)
Complete dataset of public comments posted to Reddit (http://www.reddit.com) comments from October 2007 to May 2015.
Complete dataset of public comments posted to Reddit (http://www.reddit.com) comments from October 2007 to May 2015.
Repository of televised news, including (for many) captions and rough statistics for content
Unstructured dataset of open-source media articles
Dataset of 8 million annotated YouTube videos, including a variety of audio and visual features.
Dataset of internal newsletters from the Signals Intelligence Directorate of the U.S. National Security Administration (NSA), released from 2003-2012. Dataset is slowly being released in small batches.
Various Twitter dataset collected for academic studies (largely focusing on news)
Dataset of timestamped tweets and corresponding demographic information about authors (i.e., gender and location)
List of datasets from various government agencies and initiatives
Repository of data released by San Francisco city and county
Various government datasets from the United Kingdom