数据集:
wmt16
Warning: There are issues with the Common Crawl corpus data ( training-parallel-commoncrawl.tgz ):
We have contacted the WMT organizers.
Translation dataset based on the data from statmt.org.
Versions exist for different years using a combination of data sources. The base wmt allows you to create a custom dataset by choosing your own data/language pair. This can be done as follows:
from datasets import inspect_dataset, load_dataset_builder inspect_dataset("wmt16", "path/to/scripts") builder = load_dataset_builder( "path/to/scripts/wmt_utils.py", language_pair=("fr", "de"), subsets={ datasets.Split.TRAIN: ["commoncrawl_frde"], datasets.Split.VALIDATION: ["euelections_dev2019"], }, ) # Standard version builder.download_and_prepare() ds = builder.as_dataset() # Streamable version ds = builder.as_streaming_dataset()
An example of 'validation' looks as follows.
The data fields are the same among all splits.
cs-enname | train | validation | test |
---|---|---|---|
cs-en | 997240 | 2656 | 2999 |
@InProceedings{bojar-EtAl:2016:WMT1, author = {Bojar, Ond {r}ej and Chatterjee, Rajen and Federmann, Christian and Graham, Yvette and Haddow, Barry and Huck, Matthias and Jimeno Yepes, Antonio and Koehn, Philipp and Logacheva, Varvara and Monz, Christof and Negri, Matteo and Neveol, Aurelie and Neves, Mariana and Popel, Martin and Post, Matt and Rubino, Raphael and Scarton, Carolina and Specia, Lucia and Turchi, Marco and Verspoor, Karin and Zampieri, Marcos}, title = {Findings of the 2016 Conference on Machine Translation}, booktitle = {Proceedings of the First Conference on Machine Translation}, month = {August}, year = {2016}, address = {Berlin, Germany}, publisher = {Association for Computational Linguistics}, pages = {131--198}, url = {http://www.aclweb.org/anthology/W/W16/W16-2301} }
Thanks to @thomwolf , @patrickvonplaten for adding this dataset.