数据集:
gsarti/flores_101
FLORES is a benchmark dataset for machine translation between English and low-resource languages.
Abstract from the original paper:
One of the biggest challenges hindering progress in low-resource and multilingual machine translation is the lack of good evaluation benchmarks. Current evaluation benchmarks either lack good coverage of low-resource languages, consider only restricted domains, or are low quality because they are constructed using semi-automatic procedures. In this work, we introduce the FLORES evaluation benchmark, consisting of 3001 sentences extracted from English Wikipedia and covering a variety of different topics and domains. These sentences have been translated in 101 languages by professional translators through a carefully controlled process. The resulting dataset enables better assessment of model quality on the long tail of low-resource languages, including the evaluation of many-to-many multilingual translation systems, as all translations are multilingually aligned. By publicly releasing such a high-quality and high-coverage dataset, we hope to foster progress in the machine translation community and beyond.
Disclaimer : *The Flores-101 dataset is hosted by the Facebook and licensed under the Creative Commons Attribution-ShareAlike 4.0 International License .
Refer to the Dynabench leaderboard for additional details on model evaluation on FLORES-101 in the context of the WMT2021 shared task on Large-Scale Multilingual Machine Translation .
The dataset contains parallel sentences for 101 languages, as mentioned in the original Github page for the project. Languages are identified with the ISO 639-3 code (e.g. eng , fra , rus ) as in the original dataset.
New: Use the configuration all to access the full set of parallel sentences for all the available languages in a single command.
A sample from the dev split for the Russian language ( rus config) is provided below. All configurations have the same structure, and all sentences are aligned across configurations and splits.
{ 'id': 1, 'sentence': 'В понедельник ученые из Медицинской школы Стэнфордского университета объявили об изобретении нового диагностического инструмента, который может сортировать клетки по их типу; это маленький чип, который можно напечатать, используя стандартный струйный принтер примерно за 1 цент США.', 'URL': 'https://en.wikinews.org/wiki/Scientists_say_new_medical_diagnostic_chip_can_sort_cells_anywhere_with_an_inkjet', 'domain': 'wikinews', 'topic': 'health', 'has_image': 0, 'has_hyperlink': 0 }
The text is provided as-in the original dataset, without further preprocessing or tokenization.
config | dev | devtest |
---|---|---|
all configurations | 997 | 1012: |
Please refer to the original article The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation for additional information on dataset creation.
The original authors of FLORES-101 are the curators of the original dataset. For problems or updates on this ? Datasets version, please contact gabriele.sarti996@gmail.com .
Licensed with Creative Commons Attribution Share Alike 4.0. License available here .
Please cite the authors if you use these corpora in your work:
@inproceedings{flores101, title={The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation}, author={Goyal, Naman and Gao, Cynthia and Chaudhary, Vishrav and Chen, Peng-Jen and Wenzek, Guillaume and Ju, Da and Krishnan, Sanjana and Ranzato, Marc'Aurelio and Guzm\'{a}n, Francisco and Fan, Angela}, journal={arXiv preprint arXiv:2106.03193}, year={2021} }