数据集:
allenai/wmt22_african
This dataset was created based on metadata for mined bitext released by Meta AI. It contains bitext for 248 pairs for the African languages that are part of the 2022 WMT Shared Task on Large Scale Machine Translation Evaluation for African Languages .
How to use the dataThere are two ways to access the data:
from datasets import load_dataset dataset = load_dataset("allenai/wmt22_african")
git lfs install git clone https://huggingface.co/datasets/allenai/wmt22_african
This dataset is one of resources allowed under the Constrained Track for the 2022 WMT Shared Task on Large Scale Machine Translation Evaluation for African Languages .
Language | Code |
---|---|
Afrikaans | afr |
Amharic | amh |
Chichewa | nya |
Nigerian Fulfulde | fuv |
Hausa | hau |
Igbo | ibo |
Kamba | kam |
Kinyarwanda | kin |
Lingala | lin |
Luganda | lug |
Luo | luo |
Northern Sotho | nso |
Oroma | orm |
Shona | sna |
Somali | som |
Swahili | swh |
Swati | ssw |
Tswana | tsn |
Umbundu | umb |
Wolof | wol |
Xhosa | xho |
Xitsonga | tso |
Yoruba | yor |
Zulu | zul |
Colonial linguae francae: English - eng, French - fra
The dataset contains gzipped tab delimited text files for each direction. Each text file contains lines with parallel sentences.
The dataset contains 248 language pairs.
Sentence counts for each pair can be found here .
Every instance for a language pair contains the following fields: 'translation' (containing sentence pairs), 'laser_score', 'source_sentence_lid', 'target_sentence_lid', where 'lid' is language classification probability.
Example:
{ 'translation': { 'afr': 'In Mei 2007, in ooreenstemming met die spesifikasies van die Java Gemeenskapproses, het Sun Java tegnologie geherlisensieer onder die GNU General Public License.', 'eng': 'As of May 2007, in compliance with the specifications of the Java Community Process, Sun relicensed most of its Java technologies under the GNU General Public License.' }, 'laser_score': 1.0717015266418457, 'source_sentence_lid': 0.9996600151062012, 'target_sentence_lid': 0.9972000122070312 }
The data is not split into train, dev, and test.
Parallel sentences from monolingual data in Common Crawl and ParaCrawl were identified via Language-Agnostic Sentence Representation (LASER) encoders.
Monolingual data was obtained from Common Crawl and ParaCrawl.
Who are the source language producers?Contributors to web text in Common Crawl and ParaCrawl.
The data was not human annotated. The metadata used to create the dataset can be found here: https://github.com/facebookresearch/LASER/tree/main/data/wmt22_african
Who are the annotators?The data was not human annotated. Parallel text from Common Crawl and Para Crawl monolingual data were identified automatically via LASER encoders.
[Needs More Information]
This dataset provides data for training machine learning systems for many languages that have low resources available for NLP.
Biases in the data have not been studied.
[Needs More Information]
[Needs More Information]
The dataset is released under the terms of ODC-BY . By using this, you are also bound by the Internet Archive Terms of Use in respect of the content contained in the dataset.
NLLB Team et al, No Language Left Behind: Scaling Human-Centered Machine Translation, Arxiv, 2022.
We thank the AllenNLP team at AI2 for hosting and releasing this data, including Akshita Bhagia (for engineering efforts to create the huggingface dataset), and Jesse Dodge (for organizing the connection).