数据集:

shibing624/sts-sohu2021

许可:

cc-by-4.0

源数据集:

https

批注创建人:

shibing624

语言创建人:

shibing624

大小:

size_categories:100K<n<20M

计算机处理:

语言:

子任务:

text-scoring semantic-similarity-scoring natural-language-inference

任务:

句子相似度

文本分类

数据集介绍文件清单

中文

Dataset Card for sts-sohu2021

Dataset Summary

2021搜狐校园文本匹配算法大赛数据集

数据源： https://www.biendata.xyz/competition/sohu_2021/data/

分为 A 和 B 两个文件，A 和 B 文件匹配标准不一样。其中 A 和 B 文件又分为“短短文本匹配”、“短长文本匹配”和“长长文本匹配”。 A 文件匹配标准较为宽泛，两段文字是同一个话题便视为匹配，B 文件匹配标准较为严格，两段文字须是同一个事件才视为匹配。

数据类型：

type	数据类型
dda	短短匹配 A 类
ddb	短短匹配 B 类
dca	短长匹配 A 类
dcb	短长匹配 B 类
cca	长长匹配 A 类
ccb	长长匹配 B 类

Supported Tasks and Leaderboards

Supported Tasks: 支持中文文本匹配任务，文本相似度计算等相关任务。

中文匹配任务的结果目前在顶会paper上出现较少，我罗列一个我自己训练的结果：

Leaderboard: NLI_zh leaderboard

Languages

数据集均是简体中文文本。

Dataset Structure

Data Instances

An example of 'train' looks as follows.

# A 类 短短 样本示例
{
    "sentence1": "小艺的故事让爱回家2021年2月16日大年初五19：30带上你最亲爱的人与团团君相约《小艺的故事》直播间！",
    "sentence2": "香港代购了不起啊，宋点卷竟然在直播间“炫富”起来",
    "label": 0
}

# B 类 短短 样本示例
{
    "sentence1": "让很多网友好奇的是，张柏芝在一小时后也在社交平台发文：“给大家拜年啦。”还有网友猜测：谢霆锋的经纪人发文，张柏芝也发文，并且配图，似乎都在证实，谢霆锋依旧和王菲在一起，而张柏芝也有了新的恋人，并且生了孩子，两人也找到了各自的归宿，有了自己的幸福生活，让传言不攻自破。",
    "sentence2": "陈晓东谈旧爱张柏芝，一个口误暴露她的秘密，难怪谢霆锋会离开她", 
    "label": 0
}

label: 0表示不匹配，1表示匹配。

Data Fields

The data fields are the same among all splits.

sentence1 : a string feature.
sentence2 : a string feature.
label : a classification label, with possible values including similarity (1), dissimilarity (0).

Data Splits

> wc -l *.jsonl
    11690 cca.jsonl
    11690 ccb.jsonl
    11592 dca.jsonl
    11593 dcb.jsonl
    11512 dda.jsonl
    11501 ddb.jsonl
    69578 total

Curation Rationale

作为中文NLI(natural langauge inference)数据集，这里把这个数据集上传到huggingface的datasets，方便大家使用。

Who are the source language producers?

数据集的版权归原作者所有，使用各数据集时请尊重原数据集的版权。

Who are the annotators?

原作者。

Social Impact of Dataset

This dataset was developed as a benchmark for evaluating representational systems for text, especially including those induced by representation learning methods, in the task of predicting truth conditions in a given context.

Systems that are successful at such a task may be more successful in modeling semantic representations.

Licensing Information

用于学术研究。

Contributions

shibing624 upload this dataset.

作者:

shibing624

数据集大小:

211.96 MB