数据集:
deutsche-telekom/ger-backtrans-paraphrase
This is a dataset of more than 21 million German paraphrases. These are text pairs that have the same meaning but are expressed with different words. The source of the paraphrases are different parallel German / English text corpora. The English texts were machine translated back into German to obtain the paraphrases.
This dataset can be used for example to train semantic text embeddings. To do this, for example, SentenceTransformers and the MultipleNegativesRankingLoss can be used.
This dataset is open sourced by Philip May and maintained by the One Conversation team of Deutsche Telekom AG .
Apart from the back translation, we have added more columns (for details see below). We have carried out the following pre-processing and filtering:
You probably don't want to use the dataset as it is, but filter it further. This is what the additional columns of the dataset are for. For us it has proven useful to delete the following pairs of sentences:
It is noticeable that the OpenSubtitles texts have weird dash prefixes. This looks like this:
- Hast du was draufgetan?
To remove them you could apply this function:
import re def clean_text(text): text = re.sub("^[-\s]*", "", text) text = re.sub("[-\s]*$", "", text) return text df["de"] = df["de"].apply(clean_text) df["en_de"] = df["en_de"].apply(clean_text)
Corpus name & link | Number of paraphrases |
---|---|
OpenSubtitles | 18,764,810 |
WikiMatrix v1 | 1,569,231 |
Tatoeba v2022-03-03 | 313,105 |
TED2020 v1 | 289,374 |
News-Commentary v16 | 285,722 |
GlobalVoices v2018q4 | 70,547 |
sum | . 21,292,789 |
We have made the back translation from English to German with the help of Fairseq . We used the transformer.wmt19.en-de model for this purpose:
en2de = torch.hub.load( "pytorch/fairseq", "transformer.wmt19.en-de", checkpoint_file="model1.pt:model2.pt:model3.pt:model4.pt", tokenizer="moses", bpe="fastbpe", )
To calculate the Jaccard similarity coefficient we are using the SoMaJo tokenizer to split the texts into tokens. We then lower() the tokens so that upper and lower case letters no longer make a difference. Below you can find a code snippet with the details:
from somajo import SoMaJo LANGUAGE = "de_CMC" somajo_tokenizer = SoMaJo(LANGUAGE) def get_token_set(text, somajo_tokenizer): sentences = somajo_tokenizer.tokenize_text([text]) tokens = [t.text.lower() for sentence in sentences for t in sentence] token_set = set(tokens) return token_set def jaccard_similarity(text1, text2, somajo_tokenizer): token_set1 = get_token_set(text1, somajo_tokenizer=somajo_tokenizer) token_set2 = get_token_set(text2, somajo_tokenizer=somajo_tokenizer) intersection = token_set1.intersection(token_set2) union = token_set1.union(token_set2) jaccard_similarity = float(len(intersection)) / len(union) return jaccard_similarity
# pip install datasets from datasets import load_dataset dataset = load_dataset("deutsche-telekom/ger-backtrans-paraphrase") train_dataset = dataset["train"]
If you want to download the csv file and then load it with Pandas you can do it like this:
df = pd.read_csv("train.csv")
OpenSubtitles
WikiMatrix v1
Tatoeba v2022-03-03
TED2020 v1
News-Commentary v16
GlobalVoices v2018q4
@misc{ger-backtrans-paraphrase, title={Deutsche-Telekom/ger-backtrans-paraphrase - dataset at Hugging Face}, url={https://huggingface.co/datasets/deutsche-telekom/ger-backtrans-paraphrase}, year={2022}, author={May, Philip} }
Copyright (c) 2022 Philip May , Deutsche Telekom AG
This work is licensed under CC-BY-SA 4.0 .