This repository contains information about Paradetox dataset -- the first parallel corpus for the detoxification task -- as well as models and evaluation methodology for the detoxification of English texts. The original paper "ParaDetox: Detoxification with Parallel Data" was presented at ACL 2022 main conference.
The ParaDetox Dataset collection was done via Yandex.Toloka crowdsource platform. The collection was done in three steps:
All these steps were done to ensure high quality of the data and make the process of collection automated. For more details please refer to the original paper.
As a result, we get paraphrases for 11,939 toxic sentences (on average 1.66 paraphrases per sentence), 19,766 paraphrases total. The whole dataset can be found here . The examples of samples from ParaDetox Dataset:
In addition to all ParaDetox dataset, we also make public samples that were marked by annotators as "cannot rewrite" in Task 1 of crowdsource pipeline.
The automatic evaluation of the model were produced based on three parameters:
All code used for our experiments to evluate different detoxifcation models can be run via Colab notebook
New SOTA for detoxification task -- BART (base) model trained on ParaDetox dataset -- we released online in HuggingFace? repository here .
You can also check out our demo and telegram bot .
@inproceedings{logacheva-etal-2022-paradetox, title = "{P}ara{D}etox: Detoxification with Parallel Data", author = "Logacheva, Varvara and Dementieva, Daryna and Ustyantsev, Sergey and Moskovskiy, Daniil and Dale, David and Krotova, Irina and Semenov, Nikita and Panchenko, Alexander", booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = may, year = "2022", address = "Dublin, Ireland", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.acl-long.469", pages = "6804--6818", abstract = "We present a novel pipeline for the collection of parallel data for the detoxification task. We collect non-toxic paraphrases for over 10,000 English toxic sentences. We also show that this pipeline can be used to distill a large existing corpus of paraphrases to get toxic-neutral sentence pairs. We release two parallel corpora which can be used for the training of detoxification models. To the best of our knowledge, these are the first parallel datasets for this task.We describe our pipeline in detail to make it fast to set up for a new language or domain, thus contributing to faster and easier development of new parallel resources.We train several detoxification models on the collected data and compare them with several baselines and state-of-the-art unsupervised approaches. We conduct both automatic and manual evaluations. All models trained on parallel data outperform the state-of-the-art unsupervised models by a large margin. This suggests that our novel datasets can boost the performance of detoxification systems.", }
If you find some issue, do not hesitate to add it to Github Issues .
For any questions, please contact: Daryna Dementieva ( dardem96@gmail.com )