CLIP-Kinetics700 is a compressed version of the Kinetics700 dataset using OpenAI's CLIP model.
The original dataset is ~700 GB making it difficult to use and hold in memory on one machine. By downsampling each video to 1 FPS and encoding the frames using CLIP we we're able to compress the dataset to ~8 GB making it very memory-friendly and easy to use.
clip-video-encode is a tool you can use to easily and efficiently compute CLIP embeddings from video frames. We used it to generate the embeddings for this dataset.
We formatted this as a WebDataset for better data-loading performance when training the models. Each split contains a list of tar files each with 10000 data samples. This format can be read and used easily using the EmbeddingWebDatasetReader from clip-video-encode .
CLIP-Kinetics700 ├── splits.csv ├── ds_00000.tar | ├── vid_00000.npy | ├── vid_00000.txt | ├── vid_00000.json | ├── vid_00001.npy | ├── vid_00001.txt | ├── vid_00001.json | └── ... | ├── vid_10000.npy | ├── vid_10000.txt | ├── vid_10000.json ├── ds_00001.tar | ├── vid_10001.npy | ├── vid_10001.txt | ├── vid_10001.json │ ... ...
Data was sourced from DeepMind's Kinetics700 dataset and downloaded using this convenient repository.
Using this repository we evaluate CLIP-Kinetics700 with the following simple methods:
Accuracy | |
---|---|
Top-1 | 0.31 |
Top-5 | 0.56 |
mean(Top1, Top5) | 0.44 |
Accuracy | |
---|---|
Top-1 | 0.41 |
Top-5 | 0.65 |
mean(Top1, Top5) | 0.53 |