数据集:
cointegrated/ru-paraphrase-NMT-Leipzig
任务:
文本生成语言:
ru计算机处理:
translation大小:
100K<n<1M语言创建人:
machine-generated批注创建人:
no-annotation源数据集:
extended|other许可:
cc-by-4.0The dataset contains 1 million Russian sentences and their automatically generated paraphrases.
It was created by David Dale ( @cointegrated ) by translating the rus-ru_web-public_2019_1M corpus from the Leipzig collection into English and back into Russian. A fraction of the resulting paraphrases are invalid, and should be filtered out.
The blogpost "Перефразирование русских текстов: корпуса, модели, метрики" provides a detailed description of the dataset and its properties.
The dataset can be loaded with the following code:
import datasets data = datasets.load_dataset( 'cointegrated/ru-paraphrase-NMT-Leipzig', data_files={"train": "train.csv","val": "val.csv","test": "test.csv"}, )
Its output should look like
DatasetDict({ train: Dataset({ features: ['idx', 'original', 'en', 'ru', 'chrf_sim', 'labse_sim'], num_rows: 980000 }) val: Dataset({ features: ['idx', 'original', 'en', 'ru', 'chrf_sim', 'labse_sim'], num_rows: 10000 }) test: Dataset({ features: ['idx', 'original', 'en', 'ru', 'chrf_sim', 'labse_sim'], num_rows: 10000 }) })
The dataset can be used to train and validate models for paraphrase generation or (if negative sampling is used) for paraphrase detection.
Russian (main), English (auxilliary).
Data instances look like
{ "labse_sim": 0.93502015, "chrf_sim": 0.4946451012684782, "idx": 646422, "ru": "О перспективах развития новых медиа-технологий в РФ расскажут на медиафоруме Енисея.", "original": "Перспективы развития новых медиатехнологий в Российской Федерации обсудят участники медиафорума «Енисей.", "en": "Prospects for the development of new media technologies in the Russian Federation will be discussed at the Yenisey Media Forum." }
Where original is the original sentence, and ru is its machine-generated paraphrase.
Train – 980K, validation – 10K, test – 10K. The splits were generated randomly.
There are other Russian paraphrase corpora, but they have major drawbacks:
The current corpus is generated with a dual objective: the parphrases should be semantically as close as possible to the original sentences, while being lexically different from them. Back-translation with restricted vocabulary seems to achieve this goal often enough.
The rus-ru_web-public_2019_1M corpus from the Leipzig collection as is.
The process of its creation is described in this paper :
D. Goldhahn, T. Eckart & U. Quasthoff: Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In: Proceedings of the 8th International Language Resources and Evaluation (LREC'12), 2012 .
Automatic paraphrasingThe paraphrasing was carried out by translating the original sentence to English and then back to Russian. The models facebook/wmt19-ru-en and facebook/wmt19-en-ru were used for translation. To ensure that the back-translated texts are not identical to the original texts, the final decoder was prohibited to use the token n-grams from the original texts. The code below implements the paraphrasing function.
import torch from transformers import FSMTModel, FSMTTokenizer, FSMTForConditionalGeneration tokenizer = FSMTTokenizer.from_pretrained("facebook/wmt19-en-ru") model = FSMTForConditionalGeneration.from_pretrained("facebook/wmt19-en-ru") inverse_tokenizer = FSMTTokenizer.from_pretrained("facebook/wmt19-ru-en") inverse_model = FSMTForConditionalGeneration.from_pretrained("facebook/wmt19-ru-en") model.cuda(); inverse_model.cuda(); def paraphrase(text, gram=4, num_beams=5, **kwargs): """ Generate a paraphrase using back translation. Parameter `gram` denotes size of token n-grams of the original sentence that cannot appear in the paraphrase. """ input_ids = inverse_tokenizer.encode(text, return_tensors="pt") with torch.no_grad(): outputs = inverse_model.generate(input_ids.to(inverse_model.device), num_beams=num_beams, **kwargs) other_lang = inverse_tokenizer.decode(outputs[0], skip_special_tokens=True) # print(other_lang) input_ids = input_ids[0, :-1].tolist() bad_word_ids = [input_ids[i:(i+gram)] for i in range(len(input_ids)-gram)] input_ids = tokenizer.encode(other_lang, return_tensors="pt") with torch.no_grad(): outputs = model.generate(input_ids.to(model.device), num_beams=num_beams, bad_words_ids=bad_word_ids, **kwargs) decoded = tokenizer.decode(outputs[0], skip_special_tokens=True) return decoded
The corpus was created by running the above paraphrase function on the original sentences with parameters gram=3, num_beams=5, repetition_penalty=3.14, no_repeat_ngram_size=6 .
The dataset was annotated by several automatic metrics:
Human annotation was involved only for a small subset used to train the model for p_good . It was conduced by the dataset author, @cointegrated.
The dataset is not known to contain any personal or sensitive information. The sources and processes of original data collection are described at https://wortschatz.uni-leipzig.de/en/download .
The dataset may enable creation for paraphrasing systems that can be used both for "good" purposes (such as assisting writers or augmenting text datasets), and for "bad" purposes (such as disguising plagiarism). The authors are not responsible for any uses of the dataset.
The dataset may inherit some of the biases of the underlying Leipzig web corpus or the neural machine translation models ( 1 , 2 ) with which it was generated.
Most of the paraphrases in the dataset are valid (by a rough estimante, at least 80%). However, in some sentence pairs there are faults:
The field labse_sim reflects semantic similarity between the sentences, and it can be used to filter out at least some poor paraphrases.
The dataset was created by David Dale , a.k.a. @cointegrated .
This corpus, as well as the original Leipzig corpora, are licensed under CC BY .
This blog post can be cited:
@misc{dale_paraphrasing_2021, author = "Dale, David", title = "Перефразирование русских текстов: корпуса, модели, метрики", editor = "habr.com", url = "https://habr.com/ru/post/564916/", month = {June}, year = {2021}, note = {[Online; posted 28-June-2021]}, }
Thanks to @avidale for adding this dataset.