数据集:
HuggingFaceM4/ActivitiyNet_Captions
子任务:
closed-domain-qa语言:
en计算机处理:
monolingual大小:
10K<n<100K语言创建人:
crowdsourced批注创建人:
expert-generated源数据集:
original预印本库:
arxiv:1705.00754许可:
otherThe ActivityNet Captions dataset connects videos to a series of temporally annotated sentence descriptions. Each sentence covers an unique segment of the video, describing multiple events that occur. These events may occur over very long or short periods of time and are not limited in any capacity, allowing them to co-occur. On average, each of the 20k videos contains 3.65 temporally localized sentences, resulting in a total of 100k sentences. We find that the number of sentences per video follows a relatively normal distribution. Furthermore, as the video duration increases, the number of sentences also increases. Each sentence has an average length of 13.48 words, which is also normally distributed. You can find more details of the dataset under the ActivityNet Captions Dataset section, and under supplementary materials in the paper.
The captions in the dataset are in English.
train | validation | test | Overall | |
---|---|---|---|---|
# of videos | 10,009 | 4,917 | 4,885 | 19,811 |
Quoting ActivityNet Captions' paper : "Each annotation task was divided into two steps: (1) Writing a paragraph describing all major events happening in the videos in a paragraph, with each sentence of the paragraph describing one event, and (2) Labeling the start and end time in the video in which each sentence in the paragraph event occurred."
Amazon Mechnical Turk annotators
Nothing specifically mentioned in the paper.
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
@InProceedings{tgif-cvpr2016, @inproceedings{krishna2017dense, title={Dense-Captioning Events in Videos}, author={Krishna, Ranjay and Hata, Kenji and Ren, Frederic and Fei-Fei, Li and Niebles, Juan Carlos}, booktitle={International Conference on Computer Vision (ICCV)}, year={2017} }
Thanks to @leot13 for adding this dataset.