数据集:

Antreas/TALI

任务:

零样本分类

大小:

1M<n<10M

其他:

video audio text

许可:

cc-by-4.0

数据集介绍文件清单

中文

Dataset Card for "TALI-large"

Dataset Description

Abstract

TALI is a large-scale, tetramodal dataset designed to facilitate a shift from unimodal and duomodal to tetramodal research in deep learning. It aligns text, video, images, and audio, providing a rich resource for innovative self-supervised learning tasks and multimodal research. TALI enables exploration of how different modalities and data/model scaling affect downstream performance, with the aim of inspiring diverse research ideas and enhancing understanding of model capabilities and robustness in deep learning.

Brief Description

TALI (Temporally and semantically Aligned Audio, Language and Images) is a dataset that uses the Wikipedia Image Text (WIT) captions and article titles to search Youtube for videos that match the captions. It then downloads the video, audio, and subtitles from these videos. The result is a rich multimodal dataset that has multiple caption types related to both the WiT Images, and the Youtube videos. This enables learning to take place between either temporally or semantically aligned text, images, audio and video.

Dataset Information

Modalities

The TALI dataset consists of the following modalities:

Image:

Wikipedia caption image

Randomly sampled image from youtube video

Text

Wikipedia Caption Text

Wikipedia Title Text

Wikipedia Main Body Text

YouTube Subtitle Text

YouTube Description Text

YouTube Title Text

Audio

YouTube Content Audio

Video

YouTube Content Video

Dataset Variants

The TALI dataset comes in three variants that differ in the training set size:

TALI-small: Contains about 1.3 million 30-second video clips, aligned with 120K WiT entries.
TALI-base: Contains about 6.5 million 30-second video clips, aligned with 120K WiT entries.
TALI-big: Contains about 13 million 30-second video clips, aligned with 120K WiT entries.

The validation and test sets remain consistent across all three variants at about 80K Videos aligned to 8K wikipedia entries (10 subclips for each Wikipedia entry) each.

Dataset Statistics

TBA

Dataset Creation

The TALI dataset was created by starting from the WiT dataset and using either the context_page_description or page_title as a source-query to search YouTube for video that were creative commons opted-in, and, not age restricted. The top 100 result titles were returned and compared with the source-query using the CLIP text embeddings of the largest CLIP model available. The top-1 title’s video based on the CLIP ranking was chosen and downloaded. The video was broken into 30-second segments and the top-10 segments for eachvideo were chosen based on the distance between the CLIP image embedding of the first image of each segment and the video’s title text. The image, audio, and subtitle frames were extracted from these segments. At sampling time, one of these 10 segments is randomly selected, and a 10-second segment is chosen out of the 30-second clip. The result is 200 video frames (spread throughout the 10-second segment), and 160000 audio frames (10 seconds).

Dataset Use

TALI is designed for use in a wide range of multimodal research tasks, including but not limited to:

Multimodal understanding and reasoning
Self-supervised learning
Multimodal alignment and translation
Multimodal summarization
Multimodal question answering

Dataset Curators: Antreas Antoniou

Citation Information: TBA Contributions: Thanks to all contributors including data curators, annotators, and software developers.

作者:

Antreas

数据集大小:

1.22 TB