数据集:
castorini/afriberta-corpus
This is the corpus on which AfriBERTa was trained on. The dataset is mostly from the BBC news website, but some languages also have data from Common Crawl.
The AfriBERTa corpus was mostly intended to pre-train language models.
afaanoromoo
amharic
gahuza
hausa
igbo
pidgin
somali
swahili
tigrinya
yoruba
An example to load the train split of the Somali corpus:
dataset = load_dataset("castorini/afriberta-corpus", "somali", split="train")
An example to load the test split of the Pidgin corpus:
dataset = load_dataset("castorini/afriberta-corpus", "pidgin", split="test")
Each data point is a line of text. An example from the igbo dataset:
{"id": "6", "text": "Ngwá ọrụ na-echebe ma na-ebuli gị na kọmputa."}
The data fields are:
Each language has a train and test split, with varying sizes.
Since majority of the data is obtained from the BBC's news website, models trained on this dataset are likely going to be biased towards the news domain.
Also, since some of the data is obtained from Common Crawl, care should be taken (especially for text generation models) since personal and sensitive information might be present.
@inproceedings{ogueji-etal-2021-small,
title = "Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages",
author = "Ogueji, Kelechi and
Zhu, Yuxin and
Lin, Jimmy",
booktitle = "Proceedings of the 1st Workshop on Multilingual Representation Learning",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.mrl-1.11",
pages = "116--126",
}
Thanks to Kelechi Ogueji for adding this dataset.