数据集:
tapaco
计算机处理:
multilingual语言创建人:
crowdsourced批注创建人:
machine-generated源数据集:
extended|other-tatoeba许可:
cc-by-2.0A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000 sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
Paraphrase detection and generation have become popular tasks in NLP and are increasingly integrated into a wide variety of common downstream tasks such as machine translation , information retrieval, question answering, and semantic parsing. Most of the existing datasets cover only a single language – in most cases English – or a small number of languages. Furthermore, some paraphrase datasets focus on lexical and phrasal rather than sentential paraphrases, while others are created (semi -)automatically using machine translation.
The number of sentences per language ranges from 200 to 250 000, which makes the dataset more suitable for fine-tuning and evaluation purposes than for training. It is well-suited for multi-reference evaluation of paraphrase generation models, as there is generally not a single correct way of paraphrasing a given input sentence.
The dataset contains paraphrases in Afrikaans, Arabic, Azerbaijani, Belarusian, Berber languages, Bulgarian, Bengali , Breton, Catalan; Valencian, Chavacano, Mandarin, Czech, Danish, German, Greek, Modern (1453-), English, Esperanto , Spanish; Castilian, Estonian, Basque, Finnish, French, Galician, Gronings, Hebrew, Hindi, Croatian, Hungarian , Armenian, Interlingua (International Auxiliary Language Association), Indonesian, Interlingue; Occidental, Ido , Icelandic, Italian, Japanese, Lojban, Kabyle, Korean, Cornish, Latin, Lingua Franca Nova\t, Lithuanian, Macedonian , Marathi, Bokmål, Norwegian; Norwegian Bokmål, Low German; Low Saxon; German, Low; Saxon, Low, Dutch; Flemish, ]Old Russian, Turkish, Ottoman (1500-1928), Iranian Persian, Polish, Portuguese, Rundi, Romanian; Moldavian; Moldovan, Russian, Slovenian, Serbian, Swedish, Turkmen, Tagalog, Klingon; tlhIngan-Hol, Toki Pona, Turkish, Tatar, Uighur; Uyghur, Ukrainian, Urdu, Vietnamese, Volapük, Waray, Wu Chinese and Yue Chinese
Each data instance corresponds to a paraphrase, e.g.:
{ 'paraphrase_set_id': '1483', 'sentence_id': '5778896', 'paraphrase': 'Ɣremt adlis-a.', 'lists': ['7546'], 'tags': [''], 'language': 'ber' }
Each dialogue instance has the following fields:
The dataset is having a single train split, contains a total of 1.9 million sentences, with 200 – 250 000 sentences per language
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Creative Commons Attribution 2.0 Generic
@dataset{scherrer_yves_2020_3707949, author = {Scherrer, Yves}, title = {{TaPaCo: A Corpus of Sentential Paraphrases for 73 Languages}}, month = mar, year = 2020, publisher = {Zenodo}, version = {1.0}, doi = {10.5281/zenodo.3707949}, url = {https://doi.org/10.5281/zenodo.3707949} }
Thanks to @pacman100 for adding this dataset.