TALI is a large-scale, tetramodal dataset designed to facilitate a shift from unimodal and duomodal to tetramodal research in deep learning. It aligns text, video, images, and audio, providing a rich resource for innovative self-supervised learning tasks and multimodal research. TALI enables exploration of how different modalities and data/model scaling affect downstream performance, with the aim of inspiring diverse research ideas and enhancing understanding of model capabilities and robustness in deep learning.
TALI (Temporally and semantically Aligned Audio, Language and Images) is a dataset that uses the Wikipedia Image Text (WIT) captions and article titles to search Youtube for videos that match the captions. It then downloads the video, audio, and subtitles from these videos. The result is a rich multimodal dataset that has multiple caption types related to both the WiT Images, and the Youtube videos. This enables learning to take place between either temporally or semantically aligned text, images, audio and video.
The TALI dataset consists of the following modalities:
The TALI dataset comes in three variants that differ in the training set size:
The validation and test sets remain consistent across all three variants at about 80K Videos aligned to 8K wikipedia entries (10 subclips for each Wikipedia entry) each.
TBA
The TALI dataset was created by starting from the WiT dataset and using either the context_page_description or page_title as a source-query to search YouTube for video that were creative commons opted-in, and, not age restricted. The top 100 result titles were returned and compared with the source-query using the CLIP text embeddings of the largest CLIP model available. The top-1 title’s video based on the CLIP ranking was chosen and downloaded. The video was broken into 30-second segments and the top-10 segments for eachvideo were chosen based on the distance between the CLIP image embedding of the first image of each segment and the video’s title text. The image, audio, and subtitle frames were extracted from these segments. At sampling time, one of these 10 segments is randomly selected, and a 10-second segment is chosen out of the 30-second clip. The result is 200 video frames (spread throughout the 10-second segment), and 160000 audio frames (10 seconds).
TALI is designed for use in a wide range of multimodal research tasks, including but not limited to:
Citation Information: TBA Contributions: Thanks to all contributors including data curators, annotators, and software developers.