数据集:
neuclir/csl
任务:
文本检索子任务:
document-retrieval大小:
100K<n<1M批注创建人:
no-annotation源数据集:
extended|csl许可:
apache-2.0CSL is the Chinese Scientific Literature Dataset.
The dataset contains titles, abstracts, keywords of papers written in Chinese from several academic fields.
Split | Documents |
---|---|
csl | 396k |
en_translation | 396k |
The en_translation contains documents translated from Google Translation service. All text are in English, so the fields category_eng and discipline_eng are omitted.
Using ? Datasets:
from datasets import load_dataset dataset = load_dataset('neuclir/csl')['csl']
This dataset is based off the Chinese Scientific Literature Dataset under Apache 2.0. The primay change is the addition of doc_id s, English translactions of the category and discipline descriptions by a native speaker, and basic de-duplication. Code that performed this modification is avalable in this repository .
If you use this data, please cite:
@inproceedings{li-etal-2022-csl, title = "{CSL}: A Large-scale {C}hinese Scientific Literature Dataset", author = "Li, Yudong and Zhang, Yuqing and Zhao, Zhe and Shen, Linlin and Liu, Weijie and Mao, Weiquan and Zhang, Hui", booktitle = "Proceedings of the 29th International Conference on Computational Linguistics", month = oct, year = "2022", address = "Gyeongju, Republic of Korea", publisher = "International Committee on Computational Linguistics", url = "https://aclanthology.org/2022.coling-1.344", pages = "3917--3923", }