数据集:

stsb_mt_sv

中文

Dataset Card for Swedish Machine Translated STS-B

Dataset Summary

This dataset is a Swedish machine translated version for semantic textual similarity.

Supported Tasks and Leaderboards

This dataset can be used to evaluate text similarity on Swedish.

Languages

The text in the dataset is in Swedish. The associated BCP-47 code is sv .

Dataset Structure

Data Instances

What a sample looks like:

{'score': '4.2',
 'sentence1': 'Undrar om jultomten kommer i år pga Corona..?',
 'sentence2': 'Jag undrar om jultomen kommer hit i år med tanke på covid-19',
}

Data Fields

  • score : a float representing the semantic similarity score. Where 0.0 is the lowest score and 5.0 is the highest.
  • sentence1 : a string representing a text
  • sentence2 : another string to compare the semantic with

Data Splits

The data is split into a training, validation and test set. The final split sizes are as follow:

Train Valid Test
5749 1500 1379

Dataset Creation

Curation Rationale

[Needs More Information]

Source Data

Initial Data Collection and Normalization

[Needs More Information]

Who are the source language producers?

[Needs More Information]

Annotations

Annotation process

[Needs More Information]

Who are the annotators?

[Needs More Information]

Personal and Sensitive Information

[Needs More Information]

Considerations for Using the Data

Social Impact of Dataset

[Needs More Information]

Discussion of Biases

[Needs More Information]

Other Known Limitations

[Needs More Information]

Additional Information

Dataset Curators

The machine translated version were put together by @timpal0l

Licensing Information

[Needs More Information]

Citation Information

@article{isbister2020not,
  title={Why Not Simply Translate? A First Swedish Evaluation Benchmark for Semantic Similarity},
  author={Isbister, Tim and Sahlgren, Magnus},
  journal={arXiv preprint arXiv:2009.03116},
  year={2020}
}

Contributions

Thanks to @timpal0l for adding this dataset.