Dataset Card for CSL

Dataset Description

CSL is the Chinese Scientific Literature Dataset.

Paper: https://aclanthology.org/2022.coling-1.344
Repository: https://github.com/ydli-ai/CSL

Dataset Summary

The dataset contains titles, abstracts, keywords of papers written in Chinese from several academic fields.

Languages

Chinese
English (translation)

Dataset Structure

Data Instances

Split	Documents
csl	396k
en_translation	396k

Data Fields

doc_id : unique identifier for this document
title : title of the paper
abstract : abstract of the paper
keywords : keywords associated with the paper
category : the broad category of the paper
category_eng : English translaction of the broad category (e.g., Engineering)
discipline : academic discipline of the paper
discipline_eng : English translation of the academic discipline (e.g., Agricultural Engineering)

The en_translation contains documents translated from Google Translation service. All text are in English, so the fields category_eng and discipline_eng are omitted.

Dataset Usage

Using 🤗 Datasets:

from datasets import load_dataset

dataset = load_dataset('neuclir/csl')['csl']

License & Citation

This dataset is based off the Chinese Scientific Literature Dataset under Apache 2.0. The primay change is the addition of doc_id s, English translactions of the category and discipline descriptions by a native speaker, and basic de-duplication. Code that performed this modification is avalable in this repository .

If you use this data, please cite:

@inproceedings{li-etal-2022-csl,
    title = "{CSL}: A Large-scale {C}hinese Scientific Literature Dataset",
    author = "Li, Yudong  and
      Zhang, Yuqing  and
      Zhao, Zhe  and
      Shen, Linlin  and
      Liu, Weijie  and
      Mao, Weiquan  and
      Zhang, Hui",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.344",
    pages = "3917--3923",
}

作者:

neuclir

数据集大小:

223.73 MB