数据集:
shibing624/nli_zh
任务:
文本分类语言:
zh计算机处理:
monolingual语言创建人:
shibing624批注创建人:
shibing624预印本库:
arxiv:1908.11828许可:
cc-by-4.0常见中文语义匹配数据集,包含 ATEC 、 BQ 、 LCQMC 、 PAWSX 、 STS-B 共5个任务。
数据源:
Supported Tasks: 支持中文文本匹配任务,文本相似度计算等相关任务。
中文匹配任务的结果目前在顶会paper上出现较少,我罗列一个我自己训练的结果:
Leaderboard: NLI_zh leaderboard
数据集均是简体中文文本。
An example of 'train' looks as follows.
{ "sentence1": "刘诗诗杨幂谁漂亮", "sentence2": "刘诗诗和杨幂谁漂亮", "label": 1, } { "sentence1": "汇理财怎么样", "sentence2": "怎么样去理财", "label": 0, }
The data fields are the same among all splits.
$ wc -l ATEC/* 20000 ATEC/ATEC.test.data 62477 ATEC/ATEC.train.data 20000 ATEC/ATEC.valid.data 102477 totalBQ
$ wc -l BQ/* 10000 BQ/BQ.test.data 100000 BQ/BQ.train.data 10000 BQ/BQ.valid.data 120000 totalLCQMC
$ wc -l LCQMC/* 12500 LCQMC/LCQMC.test.data 238766 LCQMC/LCQMC.train.data 8802 LCQMC/LCQMC.valid.data 260068 totalPAWSX
$ wc -l PAWSX/* 2000 PAWSX/PAWSX.test.data 49401 PAWSX/PAWSX.train.data 2000 PAWSX/PAWSX.valid.data 53401 totalSTS-B
$ wc -l STS-B/* 1361 STS-B/STS-B.test.data 5231 STS-B/STS-B.train.data 1458 STS-B/STS-B.valid.data 8050 total
作为中文NLI(natural langauge inference)数据集,这里把这个数据集上传到huggingface的datasets,方便大家使用。
数据集的版权归原作者所有,使用各数据集时请尊重原数据集的版权。
BQ: Jing Chen, Qingcai Chen, Xin Liu, Haijun Yang, Daohe Lu, Buzhou Tang, The BQ Corpus: A Large-scale Domain-specific Chinese Corpus For Sentence Semantic Equivalence Identification EMNLP2018.
原作者。
This dataset was developed as a benchmark for evaluating representational systems for text, especially including those induced by representation learning methods, in the task of predicting truth conditions in a given context.
Systems that are successful at such a task may be more successful in modeling semantic representations.
用于学术研究。
The BQ corpus is free to the public for academic research.
Thanks to @shibing624 add this dataset.