数据集:
allenai/nllb
This dataset was created based on metadata for mined bitext released by Meta AI. It contains bitext for 148 English-centric and 1465 non-English-centric language pairs using the stopes mining library and the LASER3 encoders (Heffernan et al., 2022). The complete dataset is ~450GB.
CCMatrix contains previous versions of mined instructions.
How to use the dataThere are two ways to access the data:
For accessing a particular language pair :
from datasets import load_dataset dataset = load_dataset("allenai/nllb", "ace_Latn-ban_Latn")
git lfs install git clone https://huggingface.co/datasets/allenai/nllb
N/A
Language pairs can be found here .
The dataset contains gzipped tab delimited text files for each direction. Each text file contains lines with parallel sentences.
The number of instances for each language pair can be found in the dataset_infos.json file.
Every instance for a language pair contains the following fields: 'translation' (containing sentence pairs), 'laser_score', 'source_sentence_lid', 'target_sentence_lid', where 'lid' is language classification probability, 'source_sentence_source', 'source_sentence_url', 'target_sentence_source', 'target_sentence_url'.
The lines are sorted by LASER3 score in decreasing order.
Example:
{'translation': {'ace_Latn': 'Gobnyan hana geupeukeucewa gata atawa geutinggai meunan mantong gata."', 'ban_Latn': 'Ida nenten jaga manggayang wiadin ngutang semeton."'}, 'laser_score': 1.2499876022338867, 'source_sentence_lid': 1.0000100135803223, 'target_sentence_lid': 0.9991400241851807, 'source_sentence_source': 'paracrawl9_hieu', 'source_sentence_url': '_', 'target_sentence_source': 'crawl-data/CC-MAIN-2020-10/segments/1581875144165.4/wet/CC-MAIN-20200219153707-20200219183707-00232.warc.wet.gz', 'target_sentence_url': 'https://alkitab.mobi/tb/Ula/31/6/\n'}
The data is not split. Given the noisy nature of the overall process, we recommend using the data only for training and use other datasets like Flores-200 for the evaluation. The data includes some development and test sets from other datasets, such as xlsum. In addition, sourcing data from multiple web crawls is likely to produce incidental overlap with other test sets.
Data was filtered based on language identification, emoji based filtering, and for some high-resource languages using a language model. For more details on data filtering please refer to Section 5.2 (NLLB Team et al., 2022).
Monolingual data was collected from the following sources:
Text was collected from the web and various monolingual data sets, many of which are also web crawls. This may have been written by people, generated by templates, or in some cases be machine translation output.
Parallel sentences in the monolingual data were identified using LASER3 encoders. (Heffernan et al., 2022)
Who are the annotators?The data was not human annotated.
Data may contain personally identifiable information, sensitive content, or toxic content that was publicly shared on the Internet.
This dataset provides data for training machine learning systems for many languages that have low resources available for NLP.
Biases in the data have not been specifically studied, however as the original source of data is World Wide Web it is likely that the data has biases similar to those prevalent in the Internet. The data may also exhibit biases introduced by language identification and data filtering techniques; lower resource languages generally have lower accuracy.
Some of the translations are in fact machine translations. While some website machine translation tools are identifiable from HTML source, these tools were not filtered out en mass because raw HTML was not available from some sources and CommonCrawl processing started from WET files.
The data was not curated.
The dataset is released under the terms of ODC-BY . By using this, you are also bound to the respective Terms of Use and License of the original source.
Schwenk et al, CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web. ACL https://aclanthology.org/2021.acl-long.507/ Hefferman et al, Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages. Arxiv https://arxiv.org/abs/2205.12654 , 2022. NLLB Team et al, No Language Left Behind: Scaling Human-Centered Machine Translation, Arxiv https://arxiv.org/abs/2207.04672 , 2022.
We thank the NLLB Meta AI team for open sourcing the meta data and instructions on how to use it with special thanks to Bapi Akula, Pierre Andrews, Onur Çelebi, Sergey Edunov, Kenneth Heafield, Philipp Koehn, Alex Mourachko, Safiyyah Saleem, Holger Schwenk, and Guillaume Wenzek. We also thank the AllenNLP team at AI2 for hosting and releasing this data, including Akshita Bhagia (for engineering efforts to host the data, and create the huggingface dataset), and Jesse Dodge (for organizing the connection).