数据集:

deutsche-telekom/ger-backtrans-paraphrase

中文

German Backtranslated Paraphrase Dataset

This is a dataset of more than 21 million German paraphrases. These are text pairs that have the same meaning but are expressed with different words. The source of the paraphrases are different parallel German / English text corpora. The English texts were machine translated back into German to obtain the paraphrases.

This dataset can be used for example to train semantic text embeddings. To do this, for example, SentenceTransformers and the MultipleNegativesRankingLoss can be used.

Maintainers

This dataset is open sourced by Philip May and maintained by the One Conversation team of Deutsche Telekom AG .

Our pre-processing

Apart from the back translation, we have added more columns (for details see below). We have carried out the following pre-processing and filtering:

  • We dropped text pairs where one text was longer than 499 characters.
  • In the GlobalVoices v2018q4 texts we have removed the " · Global Voices" suffix.

Your post-processing

You probably don't want to use the dataset as it is, but filter it further. This is what the additional columns of the dataset are for. For us it has proven useful to delete the following pairs of sentences:

  • min_char_len less than 15
  • jaccard_similarity greater than 0.3
  • de_token_count greater than 30
  • en_de_token_count greater than 30
  • cos_sim less than 0.85

Columns description

Anomalies in the texts

It is noticeable that the OpenSubtitles texts have weird dash prefixes. This looks like this:

- Hast du was draufgetan?

To remove them you could apply this function:

import re

def clean_text(text):
    text = re.sub("^[-\s]*", "", text)
    text = re.sub("[-\s]*$", "", text)
    return text

df["de"] = df["de"].apply(clean_text)
df["en_de"] = df["en_de"].apply(clean_text)

Parallel text corpora used

Corpus name & link Number of paraphrases
OpenSubtitles 18,764,810
WikiMatrix v1 1,569,231
Tatoeba v2022-03-03 313,105
TED2020 v1 289,374
News-Commentary v16 285,722
GlobalVoices v2018q4 70,547
sum . 21,292,789

Back translation

We have made the back translation from English to German with the help of Fairseq . We used the transformer.wmt19.en-de model for this purpose:

en2de = torch.hub.load(
    "pytorch/fairseq",
    "transformer.wmt19.en-de",
    checkpoint_file="model1.pt:model2.pt:model3.pt:model4.pt",
    tokenizer="moses",
    bpe="fastbpe",
)

How the Jaccard similarity was calculated

To calculate the Jaccard similarity coefficient we are using the SoMaJo tokenizer to split the texts into tokens. We then lower() the tokens so that upper and lower case letters no longer make a difference. Below you can find a code snippet with the details:

from somajo import SoMaJo

LANGUAGE = "de_CMC"
somajo_tokenizer = SoMaJo(LANGUAGE)

def get_token_set(text, somajo_tokenizer):
    sentences = somajo_tokenizer.tokenize_text([text])
    tokens = [t.text.lower() for sentence in sentences for t in sentence]
    token_set = set(tokens)
    return token_set

def jaccard_similarity(text1, text2, somajo_tokenizer):
    token_set1 = get_token_set(text1, somajo_tokenizer=somajo_tokenizer)
    token_set2 = get_token_set(text2, somajo_tokenizer=somajo_tokenizer)
    intersection = token_set1.intersection(token_set2)
    union = token_set1.union(token_set2)
    jaccard_similarity = float(len(intersection)) / len(union)
    return jaccard_similarity

Load this dataset

With Hugging Face Datasets

# pip install datasets
from datasets import load_dataset

dataset = load_dataset("deutsche-telekom/ger-backtrans-paraphrase")
train_dataset = dataset["train"]

With Pandas

If you want to download the csv file and then load it with Pandas you can do it like this:

df = pd.read_csv("train.csv")

Citations, Acknowledgements and Licenses

OpenSubtitles

WikiMatrix v1

Tatoeba v2022-03-03

TED2020 v1

News-Commentary v16

  • citation: J. Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS . In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)
  • license: no special license has been provided at OPUS for this dataset

GlobalVoices v2018q4

  • citation: J. Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS . In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)
  • license: no special license has been provided at OPUS for this dataset

Citation

@misc{ger-backtrans-paraphrase,
  title={Deutsche-Telekom/ger-backtrans-paraphrase - dataset at Hugging Face},
  url={https://huggingface.co/datasets/deutsche-telekom/ger-backtrans-paraphrase},
  year={2022},
  author={May, Philip}
}

Licensing

Copyright (c) 2022 Philip May , Deutsche Telekom AG

This work is licensed under CC-BY-SA 4.0 .