数据集:
stsb_mt_sv
任务:
文本分类语言:
sv计算机处理:
monolingual大小:
1K<n<10K批注创建人:
crowdsourced源数据集:
extended|other-sts-b预印本库:
arxiv:2009.03116许可:
license:unknownThis dataset is a Swedish machine translated version for semantic textual similarity.
This dataset can be used to evaluate text similarity on Swedish.
The text in the dataset is in Swedish. The associated BCP-47 code is sv .
What a sample looks like:
{'score': '4.2', 'sentence1': 'Undrar om jultomten kommer i år pga Corona..?', 'sentence2': 'Jag undrar om jultomen kommer hit i år med tanke på covid-19', }
The data is split into a training, validation and test set. The final split sizes are as follow:
Train | Valid | Test |
---|---|---|
5749 | 1500 | 1379 |
[Needs More Information]
[Needs More Information]
Who are the source language producers?[Needs More Information]
[Needs More Information]
Who are the annotators?[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
The machine translated version were put together by @timpal0l
[Needs More Information]
@article{isbister2020not, title={Why Not Simply Translate? A First Swedish Evaluation Benchmark for Semantic Similarity}, author={Isbister, Tim and Sahlgren, Magnus}, journal={arXiv preprint arXiv:2009.03116}, year={2020} }
Thanks to @timpal0l for adding this dataset.