数据集:
neuclir/csl
任务:
子任务:
document-retrieval大小:
100K<n<1M批注创建人:
no-annotation源数据集:
extended|csl许可:
CSL is the Chinese Scientific Literature Dataset.
The dataset contains titles, abstracts, keywords of papers written in Chinese from several academic fields.
| Split | Documents |
|---|---|
| csl | 396k |
| en_translation | 396k |
The en_translation contains documents translated from Google Translation service. All text are in English, so the fields category_eng and discipline_eng are omitted.
Using 🤗 Datasets:
from datasets import load_dataset
dataset = load_dataset('neuclir/csl')['csl']
This dataset is based off the Chinese Scientific Literature Dataset under Apache 2.0. The primay change is the addition of doc_id s, English translactions of the category and discipline descriptions by a native speaker, and basic de-duplication. Code that performed this modification is avalable in this repository .
If you use this data, please cite:
@inproceedings{li-etal-2022-csl,
title = "{CSL}: A Large-scale {C}hinese Scientific Literature Dataset",
author = "Li, Yudong and
Zhang, Yuqing and
Zhao, Zhe and
Shen, Linlin and
Liu, Weijie and
Mao, Weiquan and
Zhang, Hui",
booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
month = oct,
year = "2022",
address = "Gyeongju, Republic of Korea",
publisher = "International Committee on Computational Linguistics",
url = "https://aclanthology.org/2022.coling-1.344",
pages = "3917--3923",
}