数据集:

clue

任务:

文本分类

多项选择

子任务:

topic-classification semantic-similarity-scoring natural-language-inference

语言:

计算机处理:

monolingual

大小:

100K<n<1M

语言创建人:

other

批注创建人:

other

源数据集:

original

其他:

coreference-nli qa-nli

许可:

license:unknown

数据集介绍文件清单

中文

Dataset Card for "clue"

Dataset Summary

CLUE, A Chinese Language Understanding Evaluation Benchmark ( https://www.cluebenchmarks.com/ ) is a collection of resources for training, evaluating, and analyzing Chinese language understanding systems.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure

Data Instances

afqmc

Size of downloaded dataset files: 1.20 MB
Size of the generated dataset: 4.20 MB
Total amount of disk used: 5.40 MB

An example of 'validation' looks as follows.

{
    "idx": 0,
    "label": 0,
    "sentence1": "双十一花呗提额在哪",
    "sentence2": "里可以提花呗额度"
}

Size of downloaded dataset files: 3.20 MB
Size of the generated dataset: 15.69 MB
Total amount of disk used: 18.90 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "answer": "比人的灵敏",
    "choice": ["没有人的灵敏", "和人的差不多", "和人的一样好", "比人的灵敏"],
    "context": "[\"许多动物的某些器官感觉特别灵敏，它们能比人类提前知道一些灾害事件的发生，例如，海洋中的水母能预报风暴，老鼠能事先躲避矿井崩塌或有害气体，等等。地震往往能使一些动物的某些感觉器官受到刺激而发生异常反应。如一个地区的重力发生变异，某些动物可能通过它们的平衡...",
    "id": 1,
    "question": "动物的器官感觉与人的相比有什么不同?"
}

chid

Size of downloaded dataset files: 139.20 MB
Size of the generated dataset: 274.08 MB
Total amount of disk used: 413.28 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "answers": {
        "candidate_id": [3, 5, 6, 1, 7, 4, 0],
        "text": ["碌碌无为", "无所作为", "苦口婆心", "得过且过", "未雨绸缪", "软硬兼施", "传宗接代"]
    },
    "candidates": "[\"传宗接代\", \"得过且过\", \"咄咄逼人\", \"碌碌无为\", \"软硬兼施\", \"无所作为\", \"苦口婆心\", \"未雨绸缪\", \"和衷共济\", \"人老珠黄\"]...",
    "content": "[\"谈到巴萨目前的成就，瓜迪奥拉用了“坚持”两个字来形容。自从上世纪90年代克鲁伊夫带队以来，巴萨就坚持每年都有拉玛西亚球员进入一队的传统。即便是范加尔时代，巴萨强力推出的“巴萨五鹰”德拉·佩纳、哈维、莫雷罗、罗杰·加西亚和贝拉乌桑几乎#idiom0000...",
    "idx": 0
}

cluewsc2020

Size of downloaded dataset files: 0.28 MB
Size of the generated dataset: 1.03 MB
Total amount of disk used: 1.29 MB

An example of 'train' looks as follows.

{
    "idx": 0,
    "label": 1,
    "target": {
        "span1_index": 3,
        "span1_text": "伤口",
        "span2_index": 27,
        "span2_text": "它们"
    },
    "text": "裂开的伤口涂满尘土，里面有碎石子和木头刺，我小心翼翼把它们剔除出去。"
}

cmnli

Size of downloaded dataset files: 31.40 MB
Size of the generated dataset: 72.12 MB
Total amount of disk used: 103.53 MB

An example of 'train' looks as follows.

{
    "idx": 0,
    "label": 0,
    "sentence1": "从概念上讲，奶油略读有两个基本维度-产品和地理。",
    "sentence2": "产品和地理位置是使奶油撇油起作用的原因。"
}

Data Fields

The data fields are the same among all splits.

afqmc

sentence1 : a string feature.
sentence2 : a string feature.
label : a classification label, with possible values including 0 (0), 1 (1).
idx : a int32 feature.

id : a int32 feature.
context : a list of string features.
question : a string feature.
choice : a list of string features.
answer : a string feature.

chid

idx : a int32 feature.
candidates : a list of string features.
content : a list of string features.
answers : a dictionary feature containing:
- text : a string feature.
- candidate_id : a int32 feature.

cluewsc2020

idx : a int32 feature.
text : a string feature.
label : a classification label, with possible values including true (0), false (1).
span1_text : a string feature.
span2_text : a string feature.
span1_index : a int32 feature.
span2_index : a int32 feature.

cmnli

sentence1 : a string feature.
sentence2 : a string feature.
label : a classification label, with possible values including neutral (0), entailment (1), contradiction (2).
idx : a int32 feature.

Data Splits

name	train	validation	test
afqmc	34334	4316	3861
c3	11869	3816	3892
chid	84709	3218	3231
cluewsc2020	1244	304	290
cmnli	391783	12241	13880

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Additional Information

Dataset Curators

More Information Needed

Licensing Information

More Information Needed

Citation Information

@inproceedings{xu-etal-2020-clue,
    title = "{CLUE}: A {C}hinese Language Understanding Evaluation Benchmark",
    author = "Xu, Liang  and
      Hu, Hai  and
      Zhang, Xuanwei  and
      Li, Lu  and
      Cao, Chenjie  and
      Li, Yudong  and
      Xu, Yechen  and
      Sun, Kai  and
      Yu, Dian  and
      Yu, Cong  and
      Tian, Yin  and
      Dong, Qianqian  and
      Liu, Weitang  and
      Shi, Bo  and
      Cui, Yiming  and
      Li, Junyi  and
      Zeng, Jun  and
      Wang, Rongzhao  and
      Xie, Weijian  and
      Li, Yanting  and
      Patterson, Yina  and
      Tian, Zuoyu  and
      Zhang, Yiwen  and
      Zhou, He  and
      Liu, Shaoweihua  and
      Zhao, Zhe  and
      Zhao, Qipeng  and
      Yue, Cong  and
      Zhang, Xinrui  and
      Yang, Zhengliang  and
      Richardson, Kyle  and
      Lan, Zhenzhong",
    booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
    month = dec,
    year = "2020",
    address = "Barcelona, Spain (Online)",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2020.coling-main.419",
    doi = "10.18653/v1/2020.coling-main.419",
    pages = "4762--4772",
}

Contributions

Thanks to @thomwolf , @JetRunner for adding this dataset.

作者:

佚名

数据集大小:

72.66 KB

Dataset Card for "clue"

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions