数据集:
stsb_mt_sv
任务:
语言:
计算机处理:
monolingual大小:
1K<n<10K批注创建人:
crowdsourced源数据集:
extended|other-sts-b预印本库:
arxiv:2009.03116许可:
This dataset is a Swedish machine translated version for semantic textual similarity.
This dataset can be used to evaluate text similarity on Swedish.
The text in the dataset is in Swedish. The associated BCP-47 code is sv .
What a sample looks like:
{'score': '4.2',
'sentence1': 'Undrar om jultomten kommer i år pga Corona..?',
'sentence2': 'Jag undrar om jultomen kommer hit i år med tanke på covid-19',
}
The data is split into a training, validation and test set. The final split sizes are as follow:
| Train | Valid | Test |
|---|---|---|
| 5749 | 1500 | 1379 |
[Needs More Information]
[Needs More Information]
Who are the source language producers?[Needs More Information]
[Needs More Information]
Who are the annotators?[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
The machine translated version were put together by @timpal0l
[Needs More Information]
@article{isbister2020not,
title={Why Not Simply Translate? A First Swedish Evaluation Benchmark for Semantic Similarity},
author={Isbister, Tim and Sahlgren, Magnus},
journal={arXiv preprint arXiv:2009.03116},
year={2020}
}
Thanks to @timpal0l for adding this dataset.