数据集:

allenai/wmt22_african

Dataset Card for allenai/wmt22_african

Dataset Summary

This dataset was created based on metadata for mined bitext released by Meta AI. It contains bitext for 248 pairs for the African languages that are part of the 2022 WMT Shared Task on Large Scale Machine Translation Evaluation for African Languages .

How to use the data

There are two ways to access the data:

Via the Hugging Face Python datasets library

from datasets import load_dataset
dataset = load_dataset("allenai/wmt22_african")

Clone the git repo

git lfs install
git clone https://huggingface.co/datasets/allenai/wmt22_african

Supported Tasks and Leaderboards

This dataset is one of resources allowed under the Constrained Track for the 2022 WMT Shared Task on Large Scale Machine Translation Evaluation for African Languages .

Languages

Focus languages

Language	Code
Afrikaans	afr
Amharic	amh
Chichewa	nya
Nigerian Fulfulde	fuv
Hausa	hau
Igbo	ibo
Kamba	kam
Kinyarwanda	kin
Lingala	lin
Luganda	lug
Luo	luo
Northern Sotho	nso
Oroma	orm
Shona	sna
Somali	som
Swahili	swh
Swati	ssw
Tswana	tsn
Umbundu	umb
Wolof	wol
Xhosa	xho
Xitsonga	tso
Yoruba	yor
Zulu	zul

Colonial linguae francae: English - eng, French - fra

Dataset Structure

The dataset contains gzipped tab delimited text files for each direction. Each text file contains lines with parallel sentences.

Data Instances

The dataset contains 248 language pairs.

Sentence counts for each pair can be found here .

Data Fields

Every instance for a language pair contains the following fields: 'translation' (containing sentence pairs), 'laser_score', 'source_sentence_lid', 'target_sentence_lid', where 'lid' is language classification probability.

Example:

{
'translation': 
    {
        'afr': 'In Mei 2007, in ooreenstemming met die spesifikasies van die Java Gemeenskapproses, het Sun Java tegnologie geherlisensieer onder die GNU General Public License.', 
        'eng': 'As of May 2007, in compliance with the specifications of the Java Community Process, Sun relicensed most of its Java technologies under the GNU General Public License.'
    }, 
'laser_score': 1.0717015266418457, 
'source_sentence_lid': 0.9996600151062012, 
'target_sentence_lid': 0.9972000122070312
}

Data Splits

The data is not split into train, dev, and test.

Dataset Creation

Curation Rationale

Parallel sentences from monolingual data in Common Crawl and ParaCrawl were identified via Language-Agnostic Sentence Representation (LASER) encoders.

Source Data

Initial Data Collection and Normalization

Monolingual data was obtained from Common Crawl and ParaCrawl.

Who are the source language producers?

Contributors to web text in Common Crawl and ParaCrawl.

Annotations

Annotation process

The data was not human annotated. The metadata used to create the dataset can be found here: https://github.com/facebookresearch/LASER/tree/main/data/wmt22_african

Who are the annotators?

The data was not human annotated. Parallel text from Common Crawl and Para Crawl monolingual data were identified automatically via LASER encoders.

Personal and Sensitive Information

[Needs More Information]

Considerations for Using the Data

Social Impact of Dataset

This dataset provides data for training machine learning systems for many languages that have low resources available for NLP.

Discussion of Biases

Biases in the data have not been studied.

Other Known Limitations

[Needs More Information]

Additional Information

Dataset Curators

[Needs More Information]

Licensing Information

The dataset is released under the terms of ODC-BY . By using this, you are also bound by the Internet Archive Terms of Use in respect of the content contained in the dataset.

Citation Information

NLLB Team et al, No Language Left Behind: Scaling Human-Centered Machine Translation, Arxiv, 2022.

Contributions

We thank the AllenNLP team at AI2 for hosting and releasing this data, including Akshita Bhagia (for engineering efforts to create the huggingface dataset), and Jesse Dodge (for organizing the connection).

作者:

allenai

数据集大小:

5.82 GB