The dataset consists of 12,673 YouTube videos that were crawled as part of the PuppyIR project in early 2010. Meta data and comment streams (4.7 million comments in total) are available for each document. Due to copyright reasons we do not distribute the actual audio-visual content. For a subset of 1000 videos, we collected manual annotations of child suitability that were made by a childcare professional.
The corpus is delivered as a MySQL script for easy import and processing. It is available for download at: http://www.carsten-eickhoff.com/files/corpora/puppyir-youtube.tar.gz
If you would like to refer to the dataset, it was originally described and used in: