Complete Public Reddit Comments Corpus (2007-2015)
Complete dataset of public comments posted to Reddit (http://www.reddit.com) comments from October 2007 to May 2015.
Complete dataset of public comments posted to Reddit (http://www.reddit.com) comments from October 2007 to May 2015.
Repository of televised news, including (for many) captions and rough statistics for content
Unstructured dataset of open-source media articles
Dataset of 8 million annotated YouTube videos, including a variety of audio and visual features.
Transcripts from British speeches (1895 - 2015), categorized by date, speaker, party, and title
Dataset of internal newsletters from the Signals Intelligence Directorate of the U.S. National Security Administration (NSA), released from 2003-2012. Dataset is slowly being released in small batches.
Repository of speech data
Dataset of English speech (and accompanying demographic data about the speaker) using standardized elicitation paragraph