数据集:

stsb_multi_mt

中文

Dataset Card for STSb Multi MT

Dataset Summary

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums. ( source )

These are different multilingual translations and the English original of the STSbenchmark dataset . Translation has been done with deepl.com . It can be used to train sentence embeddings like T-Systems-onsite/cross-en-de-roberta-sentence-transformer .

Examples of Use

Load German dev Dataset:

from datasets import load_dataset
dataset = load_dataset("stsb_multi_mt", name="de", split="dev")

Load English train Dataset:

from datasets import load_dataset
dataset = load_dataset("stsb_multi_mt", name="en", split="train")

Supported Tasks and Leaderboards

[More Information Needed]

Languages

Available languages are: de, en, es, fr, it, nl, pl, pt, ru, zh

Dataset Structure

Data Instances

This dataset provides pairs of sentences and a score of their similarity.

score 2 example sentences explanation
5 The bird is bathing in the sink. Birdie is washing itself in the water basin. The two sentences are completely equivalent, as they mean the same thing.
4 Two boys on a couch are playing video games. Two boys are playing a video game. The two sentences are mostly equivalent, but some unimportant details differ.
3 John said he is considered a witness but not a suspect. “He is not a suspect anymore.” John said. The two sentences are roughly equivalent, but some important information differs/missing.
2 They flew out of the nest in groups. They flew into the nest together. The two sentences are not equivalent, but share some details.
1 The woman is playing the violin. The young lady enjoys listening to the guitar. The two sentences are not equivalent, but are on the same topic.
0 The black dog is running through the snow. A race car driver is driving his car through the mud. The two sentences are completely dissimilar.

An example:

{
    "sentence1": "A man is playing a large flute.",
    "sentence2": "A man is playing a flute.",
    "similarity_score": 3.8
}

Data Fields

  • sentence1 : The 1st sentence as a str .
  • sentence2 : The 2nd sentence as a str .
  • similarity_score : The similarity score as a float which is <= 5.0 and >= 0.0 .

Data Splits

  • train with 5749 samples
  • dev with 1500 samples
  • test with 1379 sampples

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

See LICENSE and download at original dataset .

Citation Information

@InProceedings{huggingface:dataset:stsb_multi_mt,
title = {Machine translated multilingual STS benchmark dataset.},
author={Philip May},
year={2021},
url={https://github.com/PhilipMay/stsb-multi-mt}
}

Contributions

Thanks to @PhilipMay for adding this dataset.