In CLaMP: Contrastive Language-Music Pre-training for Cross-Modal Symbolic Music Information Retrieval , we introduce WikiMusicText (WikiMT), a new dataset for the evaluation of semantic search and music classification. It includes 1010 lead sheets in ABC notation sourced from Wikifonia.org, each accompanied by a title, artist, genre, and description. The title and artist information is extracted from the score, whereas the genre labels are obtained by matching keywords from the Wikipedia entries and assigned to one of the 8 classes (Jazz, Country, Folk, R&B, Pop, Rock, Dance, and Latin) that loosely mimic the GTZAN genres. The description is obtained by utilizing BART-large to summarize and clean the corresponding Wikipedia entry. Additionally, the natural language information within the ABC notation is removed.
WikiMT is a unique resource to support the evaluation of semantic search and music classification. However, it is important to acknowledge that the dataset was curated from publicly available sources, and there may be limitations concerning the accuracy and completeness of the genre and description information. Further research is needed to explore the potential biases and limitations of the dataset and to develop strategies to address them. Therefore, to support additional investigations, we also provide the source files of WikiMT, including the MusicXML files from Wikifonia and the original entries from Wikipedia.
WikiMT was curated from publicly available sources and is believed to be in the public domain. However, it is important to acknowledge that copyright issues cannot be entirely ruled out. Therefore, users of the dataset should exercise caution when using it. The authors of WikiMT do not assume any legal responsibility for the use of the dataset. If you have any questions or concerns regarding the dataset's copyright status, please contact the authors at shangda@mail.ccom.edu.cn .
@misc{wu2023clamp, title={CLaMP: Contrastive Language-Music Pre-training for Cross-Modal Symbolic Music Information Retrieval}, author={Shangda Wu and Dingyao Yu and Xu Tan and Maosong Sun}, year={2023}, eprint={2304.11029}, archivePrefix={arXiv}, primaryClass={cs.SD} }