数据集:
wmt17
Warning: There are issues with the Common Crawl corpus data ( training-parallel-commoncrawl.tgz ):
We have contacted the WMT organizers.
Translation dataset based on the data from statmt.org.
Versions exist for different years using a combination of data sources. The base wmt allows you to create a custom dataset by choosing your own data/language pair. This can be done as follows:
from datasets import inspect_dataset, load_dataset_builder inspect_dataset("wmt17", "path/to/scripts") builder = load_dataset_builder( "path/to/scripts/wmt_utils.py", language_pair=("fr", "de"), subsets={ datasets.Split.TRAIN: ["commoncrawl_frde"], datasets.Split.VALIDATION: ["euelections_dev2019"], }, ) # Standard version builder.download_and_prepare() ds = builder.as_dataset() # Streamable version ds = builder.as_streaming_dataset()
An example of 'train' looks as follows.
The data fields are the same among all splits.
cs-enname | train | validation | test |
---|---|---|---|
cs-en | 1018291 | 2999 | 3005 |
@InProceedings{bojar-EtAl:2017:WMT1, author = {Bojar, Ond {r}ej and Chatterjee, Rajen and Federmann, Christian and Graham, Yvette and Haddow, Barry and Huang, Shujian and Huck, Matthias and Koehn, Philipp and Liu, Qun and Logacheva, Varvara and Monz, Christof and Negri, Matteo and Post, Matt and Rubino, Raphael and Specia, Lucia and Turchi, Marco}, title = {Findings of the 2017 Conference on Machine Translation (WMT17)}, booktitle = {Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers}, month = {September}, year = {2017}, address = {Copenhagen, Denmark}, publisher = {Association for Computational Linguistics}, pages = {169--214}, url = {http://www.aclweb.org/anthology/W17-4717} }
Thanks to @patrickvonplaten , @thomwolf for adding this dataset.