数据集:

stsb_mt_sv

任务:

文本分类

子任务:

text-scoring semantic-similarity-scoring

语言:

计算机处理:

monolingual

大小:

1K<n<10K

语言创建人:

crowdsourced machine-generated

批注创建人:

crowdsourced

源数据集:

extended|other-sts-b

预印本库:

arxiv:2009.03116

许可:

license:unknown

数据集介绍文件清单

中文

Dataset Card for Swedish Machine Translated STS-B

Dataset Summary

This dataset is a Swedish machine translated version for semantic textual similarity.

Supported Tasks and Leaderboards

This dataset can be used to evaluate text similarity on Swedish.

Languages

The text in the dataset is in Swedish. The associated BCP-47 code is sv .

Dataset Structure

Data Instances

What a sample looks like:

{'score': '4.2',
 'sentence1': 'Undrar om jultomten kommer i år pga Corona..?',
 'sentence2': 'Jag undrar om jultomen kommer hit i år med tanke på covid-19',
}

Data Fields

score : a float representing the semantic similarity score. Where 0.0 is the lowest score and 5.0 is the highest.
sentence1 : a string representing a text
sentence2 : another string to compare the semantic with

Data Splits

The data is split into a training, validation and test set. The final split sizes are as follow:

Train	Valid	Test
5749	1500	1379

Dataset Creation

Curation Rationale

[Needs More Information]

Source Data

Initial Data Collection and Normalization

[Needs More Information]

Who are the source language producers?

[Needs More Information]

Annotations

Annotation process

[Needs More Information]

Who are the annotators?

[Needs More Information]

Personal and Sensitive Information

[Needs More Information]

Considerations for Using the Data

Social Impact of Dataset

[Needs More Information]

Discussion of Biases

[Needs More Information]

Other Known Limitations

[Needs More Information]

Additional Information

Dataset Curators

The machine translated version were put together by @timpal0l

Licensing Information

[Needs More Information]

Citation Information

@article{isbister2020not,
  title={Why Not Simply Translate? A First Swedish Evaluation Benchmark for Semantic Similarity},
  author={Isbister, Tim and Sahlgren, Magnus},
  journal={arXiv preprint arXiv:2009.03116},
  year={2020}
}

Contributions

Thanks to @timpal0l for adding this dataset.

作者:

佚名

数据集大小:

10.1 KB