数据集:
para_crawl
任务:
计算机处理:
translation大小:
10M<n<100M语言创建人:
found批注创建人:
no-annotation源数据集:
original许可:
Web-Scale Parallel Corpora for Official European Languages.
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"translation": "{\"bg\": \". “A felirat faragott karnis a bejárat fölött, templom épült 14 Július 1643, A földesúr és felesége Jeremiás Murguleţ, C..."
}
encs
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"translation": "{\"cs\": \". “A felirat faragott karnis a bejárat fölött, templom épült 14 Július 1643, A földesúr és felesége Jeremiás Murguleţ, C..."
}
enda
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"translation": "{\"da\": \". “A felirat faragott karnis a bejárat fölött, templom épült 14 Július 1643, A földesúr és felesége Jeremiás Murguleţ, C..."
}
ende
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"translation": "{\"de\": \". “A felirat faragott karnis a bejárat fölött, templom épült 14 Július 1643, A földesúr és felesége Jeremiás Murguleţ, C..."
}
enel
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"translation": "{\"el\": \". “A felirat faragott karnis a bejárat fölött, templom épült 14 Július 1643, A földesúr és felesége Jeremiás Murguleţ, C..."
}
The data fields are the same among all splits.
enbg| name | train |
|---|---|
| enbg | 1039885 |
| encs | 2981949 |
| enda | 2414895 |
| ende | 16264448 |
| enel | 1985233 |
Creative Commons CC0 license ("no rights reserved") .
@inproceedings{banon-etal-2020-paracrawl,
title = "{P}ara{C}rawl: Web-Scale Acquisition of Parallel Corpora",
author = "Ba{\~n}{\'o}n, Marta and
Chen, Pinzhen and
Haddow, Barry and
Heafield, Kenneth and
Hoang, Hieu and
Espl{\`a}-Gomis, Miquel and
Forcada, Mikel L. and
Kamran, Amir and
Kirefu, Faheem and
Koehn, Philipp and
Ortiz Rojas, Sergio and
Pla Sempere, Leopoldo and
Ram{\'\i}rez-S{\'a}nchez, Gema and
Sarr{\'\i}as, Elsa and
Strelec, Marek and
Thompson, Brian and
Waites, William and
Wiggins, Dion and
Zaragoza, Jaume",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.acl-main.417",
doi = "10.18653/v1/2020.acl-main.417",
pages = "4555--4567",
abstract = "We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.",
}
Thanks to @thomwolf , @lewtun , @patrickvonplaten , @mariamabarham for adding this dataset.