数据集:

shibing624/nli_zh

中文

Dataset Card for NLI_zh

Dataset Summary

常见中文语义匹配数据集,包含 ATEC BQ LCQMC PAWSX STS-B 共5个任务。

数据源:

Supported Tasks and Leaderboards

Supported Tasks: 支持中文文本匹配任务,文本相似度计算等相关任务。

中文匹配任务的结果目前在顶会paper上出现较少,我罗列一个我自己训练的结果:

Leaderboard: NLI_zh leaderboard

Languages

数据集均是简体中文文本。

Dataset Structure

Data Instances

An example of 'train' looks as follows.

{
  "sentence1": "刘诗诗杨幂谁漂亮",
  "sentence2": "刘诗诗和杨幂谁漂亮",
  "label": 1,
}
{
  "sentence1": "汇理财怎么样",
  "sentence2": "怎么样去理财",
  "label": 0,
}

Data Fields

The data fields are the same among all splits.

  • sentence1 : a string feature.
  • sentence2 : a string feature.
  • label : a classification label, with possible values including similarity (1), dissimilarity (0).

Data Splits

ATEC
$ wc -l ATEC/*
   20000 ATEC/ATEC.test.data
   62477 ATEC/ATEC.train.data
   20000 ATEC/ATEC.valid.data
  102477 total
BQ
$ wc -l BQ/*
   10000 BQ/BQ.test.data
  100000 BQ/BQ.train.data
   10000 BQ/BQ.valid.data
  120000 total
LCQMC
$ wc -l LCQMC/*
   12500 LCQMC/LCQMC.test.data
  238766 LCQMC/LCQMC.train.data
    8802 LCQMC/LCQMC.valid.data
  260068 total
PAWSX
$ wc -l PAWSX/*
    2000 PAWSX/PAWSX.test.data
   49401 PAWSX/PAWSX.train.data
    2000 PAWSX/PAWSX.valid.data
   53401 total
STS-B
$ wc -l STS-B/*
    1361 STS-B/STS-B.test.data
    5231 STS-B/STS-B.train.data
    1458 STS-B/STS-B.valid.data
    8050 total

Dataset Creation

Curation Rationale

作为中文NLI(natural langauge inference)数据集,这里把这个数据集上传到huggingface的datasets,方便大家使用。

Source Data

Initial Data Collection and Normalization Who are the source language producers?

数据集的版权归原作者所有,使用各数据集时请尊重原数据集的版权。

BQ: Jing Chen, Qingcai Chen, Xin Liu, Haijun Yang, Daohe Lu, Buzhou Tang, The BQ Corpus: A Large-scale Domain-specific Chinese Corpus For Sentence Semantic Equivalence Identification EMNLP2018.

Annotations

Annotation process Who are the annotators?

原作者。

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

This dataset was developed as a benchmark for evaluating representational systems for text, especially including those induced by representation learning methods, in the task of predicting truth conditions in a given context.

Systems that are successful at such a task may be more successful in modeling semantic representations.

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

  • 苏剑林对文件名称有整理
  • 我上传到huggingface的datasets

Licensing Information

用于学术研究。

The BQ corpus is free to the public for academic research.

Contributions

Thanks to @shibing624 add this dataset.