数据集:

deutsche-telekom/ger-backtrans-paraphrase

任务:

句子相似度

语言:

计算机处理:

monolingual

大小:

10M<n<100M

预印本库:

arxiv:1907.05791 arxiv:2004.09813

许可:

cc-by-sa-4.0

数据集介绍文件清单

中文

German Backtranslated Paraphrase Dataset

This is a dataset of more than 21 million German paraphrases. These are text pairs that have the same meaning but are expressed with different words. The source of the paraphrases are different parallel German / English text corpora. The English texts were machine translated back into German to obtain the paraphrases.

This dataset can be used for example to train semantic text embeddings. To do this, for example, SentenceTransformers and the MultipleNegativesRankingLoss can be used.

Maintainers

This dataset is open sourced by Philip May and maintained by the One Conversation team of Deutsche Telekom AG .

Our pre-processing

Apart from the back translation, we have added more columns (for details see below). We have carried out the following pre-processing and filtering:

We dropped text pairs where one text was longer than 499 characters.
In the GlobalVoices v2018q4 texts we have removed the " · Global Voices" suffix.

Your post-processing

You probably don't want to use the dataset as it is, but filter it further. This is what the additional columns of the dataset are for. For us it has proven useful to delete the following pairs of sentences:

min_char_len less than 15
jaccard_similarity greater than 0.3
de_token_count greater than 30
en_de_token_count greater than 30
cos_sim less than 0.85

Columns description

uuid : a uuid calculated with Python uuid.uuid4()
en : the original English texts from the corpus
de : the original German texts from the corpus
en_de : the German texts translated back from English (from en )
corpus : the name of the corpus
min_char_len : the number of characters of the shortest text
jaccard_similarity : the Jaccard similarity coefficient of both sentences - see below for more details
de_token_count : number of tokens of the de text, tokenized with deepset/gbert-large
en_de_token_count : number of tokens of the de text, tokenized with deepset/gbert-large
cos_sim : the cosine similarity of both sentences measured with sentence-transformers/paraphrase-multilingual-mpnet-base-v2

Anomalies in the texts

It is noticeable that the OpenSubtitles texts have weird dash prefixes. This looks like this:

- Hast du was draufgetan?

To remove them you could apply this function:

import re

def clean_text(text):
    text = re.sub("^[-\s]*", "", text)
    text = re.sub("[-\s]*$", "", text)
    return text

df["de"] = df["de"].apply(clean_text)
df["en_de"] = df["en_de"].apply(clean_text)

Parallel text corpora used

Corpus name & link	Number of paraphrases
OpenSubtitles	18,764,810
WikiMatrix v1	1,569,231
Tatoeba v2022-03-03	313,105
TED2020 v1	289,374
News-Commentary v16	285,722
GlobalVoices v2018q4	70,547
sum	. 21,292,789

Back translation

We have made the back translation from English to German with the help of Fairseq . We used the transformer.wmt19.en-de model for this purpose:

en2de = torch.hub.load(
    "pytorch/fairseq",
    "transformer.wmt19.en-de",
    checkpoint_file="model1.pt:model2.pt:model3.pt:model4.pt",
    tokenizer="moses",
    bpe="fastbpe",
)

How the Jaccard similarity was calculated

To calculate the Jaccard similarity coefficient we are using the SoMaJo tokenizer to split the texts into tokens. We then lower() the tokens so that upper and lower case letters no longer make a difference. Below you can find a code snippet with the details:

from somajo import SoMaJo

LANGUAGE = "de_CMC"
somajo_tokenizer = SoMaJo(LANGUAGE)

def get_token_set(text, somajo_tokenizer):
    sentences = somajo_tokenizer.tokenize_text([text])
    tokens = [t.text.lower() for sentence in sentences for t in sentence]
    token_set = set(tokens)
    return token_set

def jaccard_similarity(text1, text2, somajo_tokenizer):
    token_set1 = get_token_set(text1, somajo_tokenizer=somajo_tokenizer)
    token_set2 = get_token_set(text2, somajo_tokenizer=somajo_tokenizer)
    intersection = token_set1.intersection(token_set2)
    union = token_set1.union(token_set2)
    jaccard_similarity = float(len(intersection)) / len(union)
    return jaccard_similarity

Load this dataset

With Hugging Face Datasets

# pip install datasets
from datasets import load_dataset

dataset = load_dataset("deutsche-telekom/ger-backtrans-paraphrase")
train_dataset = dataset["train"]

With Pandas

If you want to download the csv file and then load it with Pandas you can do it like this:

df = pd.read_csv("train.csv")

Citations, Acknowledgements and Licenses

OpenSubtitles

citation: P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles . In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)
also see http://www.opensubtitles.org/
license: no special license has been provided at OPUS for this dataset

WikiMatrix v1

citation: Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman, WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia , arXiv, July 11 2019
license: CC-BY-SA 4.0

Tatoeba v2022-03-03

citation: J. Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS . In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)
license: CC BY 2.0 FR
copyright: https://tatoeba.org/eng/terms_of_use

TED2020 v1

citation: Reimers, Nils and Gurevych, Iryna, Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation , In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, November 2020
acknowledgements to OPUS for this service
license: please respect the TED Talks Usage Policy

News-Commentary v16

citation: J. Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS . In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)
license: no special license has been provided at OPUS for this dataset

GlobalVoices v2018q4

citation: J. Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS . In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)
license: no special license has been provided at OPUS for this dataset

Citation

@misc{ger-backtrans-paraphrase,
  title={Deutsche-Telekom/ger-backtrans-paraphrase - dataset at Hugging Face},
  url={https://huggingface.co/datasets/deutsche-telekom/ger-backtrans-paraphrase},
  year={2022},
  author={May, Philip}
}

Licensing

This work is licensed under CC-BY-SA 4.0 .

作者:

deutsche-telekom

数据集大小:

1.93 GB