数据集:
castorini/afriberta-corpus
This is the corpus on which AfriBERTa was trained on. The dataset is mostly from the BBC news website, but some languages also have data from Common Crawl.
The AfriBERTa corpus was mostly intended to pre-train language models.
afaanoromoo amharic gahuza hausa igbo pidgin somali swahili tigrinya yoruba
An example to load the train split of the Somali corpus:
dataset = load_dataset("castorini/afriberta-corpus", "somali", split="train")
An example to load the test split of the Pidgin corpus:
dataset = load_dataset("castorini/afriberta-corpus", "pidgin", split="test")
Each data point is a line of text. An example from the igbo dataset:
{"id": "6", "text": "Ngwá ọrụ na-echebe ma na-ebuli gị na kọmputa."}
The data fields are:
Each language has a train and test split, with varying sizes.
Since majority of the data is obtained from the BBC's news website, models trained on this dataset are likely going to be biased towards the news domain.
Also, since some of the data is obtained from Common Crawl, care should be taken (especially for text generation models) since personal and sensitive information might be present.
@inproceedings{ogueji-etal-2021-small, title = "Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages", author = "Ogueji, Kelechi and Zhu, Yuxin and Lin, Jimmy", booktitle = "Proceedings of the 1st Workshop on Multilingual Representation Learning", month = nov, year = "2021", address = "Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.mrl-1.11", pages = "116--126", }
Thanks to Kelechi Ogueji for adding this dataset.