数据集:
wmt15
Warning: There are issues with the Common Crawl corpus data ( training-parallel-commoncrawl.tgz ):
We have contacted the WMT organizers.
Translation dataset based on the data from statmt.org.
Versions exist for different years using a combination of data sources. The base wmt allows you to create a custom dataset by choosing your own data/language pair. This can be done as follows:
from datasets import inspect_dataset, load_dataset_builder inspect_dataset("wmt15", "path/to/scripts") builder = load_dataset_builder( "path/to/scripts/wmt_utils.py", language_pair=("fr", "de"), subsets={ datasets.Split.TRAIN: ["commoncrawl_frde"], datasets.Split.VALIDATION: ["euelections_dev2019"], }, ) # Standard version builder.download_and_prepare() ds = builder.as_dataset() # Streamable version ds = builder.as_streaming_dataset()
An example of 'validation' looks as follows.
The data fields are the same among all splits.
cs-enname | train | validation | test |
---|---|---|---|
cs-en | 959768 | 3003 | 2656 |
@InProceedings{bojar-EtAl:2015:WMT, author = {Bojar, Ond {r}ej and Chatterjee, Rajen and Federmann, Christian and Haddow, Barry and Huck, Matthias and Hokamp, Chris and Koehn, Philipp and Logacheva, Varvara and Monz, Christof and Negri, Matteo and Post, Matt and Scarton, Carolina and Specia, Lucia and Turchi, Marco}, title = {Findings of the 2015 Workshop on Statistical Machine Translation}, booktitle = {Proceedings of the Tenth Workshop on Statistical Machine Translation}, month = {September}, year = {2015}, address = {Lisbon, Portugal}, publisher = {Association for Computational Linguistics}, pages = {1--46}, url = {http://aclweb.org/anthology/W15-3001} }
Thanks to @thomwolf , @patrickvonplaten for adding this dataset.