数据集:

para_crawl

任务:

翻译

语言:

计算机处理:

translation

大小:

10M<n<100M

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original

许可:

cc0-1.0

数据集介绍文件清单

中文

Dataset Card for "para_crawl"

Dataset Summary

Web-Scale Parallel Corpora for Official European Languages.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure

Data Instances

enbg

Size of downloaded dataset files: 103.75 MB
Size of the generated dataset: 356.54 MB
Total amount of disk used: 460.27 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "translation": "{\"bg\": \". “A felirat faragott karnis a bejárat fölött, templom épült 14 Július 1643, A földesúr és felesége Jeremiás Murguleţ, C..."
}

encs

Size of downloaded dataset files: 196.41 MB
Size of the generated dataset: 638.07 MB
Total amount of disk used: 834.48 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "translation": "{\"cs\": \". “A felirat faragott karnis a bejárat fölött, templom épült 14 Július 1643, A földesúr és felesége Jeremiás Murguleţ, C..."
}

enda

Size of downloaded dataset files: 182.81 MB
Size of the generated dataset: 598.62 MB
Total amount of disk used: 781.43 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "translation": "{\"da\": \". “A felirat faragott karnis a bejárat fölött, templom épült 14 Július 1643, A földesúr és felesége Jeremiás Murguleţ, C..."
}

ende

Size of downloaded dataset files: 1.31 GB
Size of the generated dataset: 4.00 GB
Total amount of disk used: 5.30 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "translation": "{\"de\": \". “A felirat faragott karnis a bejárat fölött, templom épült 14 Július 1643, A földesúr és felesége Jeremiás Murguleţ, C..."
}

enel

Size of downloaded dataset files: 193.56 MB
Size of the generated dataset: 688.07 MB
Total amount of disk used: 881.62 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "translation": "{\"el\": \". “A felirat faragott karnis a bejárat fölött, templom épült 14 Július 1643, A földesúr és felesége Jeremiás Murguleţ, C..."
}

Data Fields

The data fields are the same among all splits.

enbg

translation : a multilingual string variable, with possible languages including en , bg .

encs

translation : a multilingual string variable, with possible languages including en , cs .

enda

translation : a multilingual string variable, with possible languages including en , da .

ende

translation : a multilingual string variable, with possible languages including en , de .

enel

translation : a multilingual string variable, with possible languages including en , el .

Data Splits

name	train
enbg	1039885
encs	2981949
enda	2414895
ende	16264448
enel	1985233

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Additional Information

Dataset Curators

More Information Needed

Licensing Information

Creative Commons CC0 license ("no rights reserved") .

Citation Information

@inproceedings{banon-etal-2020-paracrawl,
    title = "{P}ara{C}rawl: Web-Scale Acquisition of Parallel Corpora",
    author = "Ba{\~n}{\'o}n, Marta  and
      Chen, Pinzhen  and
      Haddow, Barry  and
      Heafield, Kenneth  and
      Hoang, Hieu  and
      Espl{\`a}-Gomis, Miquel  and
      Forcada, Mikel L.  and
      Kamran, Amir  and
      Kirefu, Faheem  and
      Koehn, Philipp  and
      Ortiz Rojas, Sergio  and
      Pla Sempere, Leopoldo  and
      Ram{\'\i}rez-S{\'a}nchez, Gema  and
      Sarr{\'\i}as, Elsa  and
      Strelec, Marek  and
      Thompson, Brian  and
      Waites, William  and
      Wiggins, Dion  and
      Zaragoza, Jaume",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.acl-main.417",
    doi = "10.18653/v1/2020.acl-main.417",
    pages = "4555--4567",
    abstract = "We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.",
}

Contributions

Thanks to @thomwolf , @lewtun , @patrickvonplaten , @mariamabarham for adding this dataset.

作者:

佚名

数据集大小:

44.87 KB

Dataset Card for "para_crawl"

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions