数据集:

neuclir/csl

语言:

zh en

大小:

100K<n<1M

批注创建人:

no-annotation

源数据集:

extended|csl

许可:

apache-2.0
中文

Dataset Card for CSL

Dataset Description

CSL is the Chinese Scientific Literature Dataset.

Dataset Summary

The dataset contains titles, abstracts, keywords of papers written in Chinese from several academic fields.

Languages

  • Chinese
  • English (translation)

Dataset Structure

Data Instances

Split Documents
csl 396k
en_translation 396k

Data Fields

  • doc_id : unique identifier for this document
  • title : title of the paper
  • abstract : abstract of the paper
  • keywords : keywords associated with the paper
  • category : the broad category of the paper
  • category_eng : English translaction of the broad category (e.g., Engineering)
  • discipline : academic discipline of the paper
  • discipline_eng : English translation of the academic discipline (e.g., Agricultural Engineering)

The en_translation contains documents translated from Google Translation service. All text are in English, so the fields category_eng and discipline_eng are omitted.

Dataset Usage

Using ? Datasets:

from datasets import load_dataset

dataset = load_dataset('neuclir/csl')['csl']

License & Citation

This dataset is based off the Chinese Scientific Literature Dataset under Apache 2.0. The primay change is the addition of doc_id s, English translactions of the category and discipline descriptions by a native speaker, and basic de-duplication. Code that performed this modification is avalable in this repository .

If you use this data, please cite:

@inproceedings{li-etal-2022-csl,
    title = "{CSL}: A Large-scale {C}hinese Scientific Literature Dataset",
    author = "Li, Yudong  and
      Zhang, Yuqing  and
      Zhao, Zhe  and
      Shen, Linlin  and
      Liu, Weijie  and
      Mao, Weiquan  and
      Zhang, Hui",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.344",
    pages = "3917--3923",
}