数据集:

para_crawl

任务:

翻译

计算机处理:

translation

大小:

10M<n<100M

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original

许可:

cc0-1.0
中文

Dataset Card for "para_crawl"

Dataset Summary

Web-Scale Parallel Corpora for Official European Languages.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure

Data Instances

enbg
  • Size of downloaded dataset files: 103.75 MB
  • Size of the generated dataset: 356.54 MB
  • Total amount of disk used: 460.27 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "translation": "{\"bg\": \". “A felirat faragott karnis a bejárat fölött, templom épült 14 Július 1643, A földesúr és felesége Jeremiás Murguleţ, C..."
}
encs
  • Size of downloaded dataset files: 196.41 MB
  • Size of the generated dataset: 638.07 MB
  • Total amount of disk used: 834.48 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "translation": "{\"cs\": \". “A felirat faragott karnis a bejárat fölött, templom épült 14 Július 1643, A földesúr és felesége Jeremiás Murguleţ, C..."
}
enda
  • Size of downloaded dataset files: 182.81 MB
  • Size of the generated dataset: 598.62 MB
  • Total amount of disk used: 781.43 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "translation": "{\"da\": \". “A felirat faragott karnis a bejárat fölött, templom épült 14 Július 1643, A földesúr és felesége Jeremiás Murguleţ, C..."
}
ende
  • Size of downloaded dataset files: 1.31 GB
  • Size of the generated dataset: 4.00 GB
  • Total amount of disk used: 5.30 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "translation": "{\"de\": \". “A felirat faragott karnis a bejárat fölött, templom épült 14 Július 1643, A földesúr és felesége Jeremiás Murguleţ, C..."
}
enel
  • Size of downloaded dataset files: 193.56 MB
  • Size of the generated dataset: 688.07 MB
  • Total amount of disk used: 881.62 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "translation": "{\"el\": \". “A felirat faragott karnis a bejárat fölött, templom épült 14 Július 1643, A földesúr és felesége Jeremiás Murguleţ, C..."
}

Data Fields

The data fields are the same among all splits.

enbg
  • translation : a multilingual string variable, with possible languages including en , bg .
encs
  • translation : a multilingual string variable, with possible languages including en , cs .
enda
  • translation : a multilingual string variable, with possible languages including en , da .
ende
  • translation : a multilingual string variable, with possible languages including en , de .
enel
  • translation : a multilingual string variable, with possible languages including en , el .

Data Splits

name train
enbg 1039885
encs 2981949
enda 2414895
ende 16264448
enel 1985233

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Social Impact of Dataset

More Information Needed

Discussion of Biases

More Information Needed

Other Known Limitations

More Information Needed

Additional Information

Dataset Curators

More Information Needed

Licensing Information

Creative Commons CC0 license ("no rights reserved") .

Citation Information

@inproceedings{banon-etal-2020-paracrawl,
    title = "{P}ara{C}rawl: Web-Scale Acquisition of Parallel Corpora",
    author = "Ba{\~n}{\'o}n, Marta  and
      Chen, Pinzhen  and
      Haddow, Barry  and
      Heafield, Kenneth  and
      Hoang, Hieu  and
      Espl{\`a}-Gomis, Miquel  and
      Forcada, Mikel L.  and
      Kamran, Amir  and
      Kirefu, Faheem  and
      Koehn, Philipp  and
      Ortiz Rojas, Sergio  and
      Pla Sempere, Leopoldo  and
      Ram{\'\i}rez-S{\'a}nchez, Gema  and
      Sarr{\'\i}as, Elsa  and
      Strelec, Marek  and
      Thompson, Brian  and
      Waites, William  and
      Wiggins, Dion  and
      Zaragoza, Jaume",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.acl-main.417",
    doi = "10.18653/v1/2020.acl-main.417",
    pages = "4555--4567",
    abstract = "We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.",
}

Contributions

Thanks to @thomwolf , @lewtun , @patrickvonplaten , @mariamabarham for adding this dataset.