数据集:
DISCOX/DISCO-10M
You can download the dataset using HuggingFace:
from datasets import load_dataset
ds = load_dataset("DISCOX/DISCO-10M")
 The dataset contains the following features:
{
 'video_url_youtube',
 'video_title_youtube',
 'track_name_spotify',
 'video_duration_youtube_sec',
 'preview_url_spotify',
 'video_view_count_youtube',
 'video_thumbnail_url_youtube',
 'search_query_youtube',
 'video_description_youtube',
 'track_id_spotify',
 'album_id_spotify',
 'artist_id_spotify',
 'track_duration_spotify_ms',
 'primary_artist_name_spotify',
 'track_release_date_spotify',
 'explicit_content_spotify',
 'similarity_duration',
 'similarity_query_video_title',
 'similarity_query_description',
 'similarity_audio',
 'audio_embedding_spotify',
 'audio_embedding_youtube',
}
 DISCO-10M is a music dataset created to democratize research on large-scale machine learning models for music.
The dataset contains no music due to copyright laws. The audio embedding features were computed using Laion-CLAP , and can be used instead of the raw audio for many down-stream tasks. In case the raw audio is needed, it can be downloaded from the provided Spotify preview URL or via the YouTube link. DISCO-10M was created by collecting a list of 400,000 artist IDs and 2.6M track IDs from Spotify, and collecting YouTube video links that match the track duration, artist name, and track names. These matches were computed using the following three similarity metrics:
For DISCO-10M we only keep samples that return true for: duration_similarity > 0.25 and (description_similarity > 0.65 or title_similarity > 0.65) and audio_similarity > 0.4
We offer three subsets based on DISCO-10M:
To cite our work, please refer to our paper here .