模型:
castorini/afriberta_base
language:
AfriBERTa base is a pretrained multilingual language model with around 111 million parameters. The model has 8 layers, 6 attention heads, 768 hidden units and 3072 feed forward size. The model was pretrained on 11 African languages namely - Afaan Oromoo (also called Oromo), Amharic, Gahuza (a mixed language containing Kinyarwanda and Kirundi), Hausa, Igbo, Nigerian Pidgin, Somali, Swahili, Tigrinya and Yorùbá. The model has been shown to obtain competitive downstream performances on text classification and Named Entity Recognition on several African languages, including those it was not pretrained on.
You can use this model with Transformers for any downstream task. For example, assuming we want to finetune this model on a token classification task, we do the following:
>>> from transformers import AutoTokenizer, AutoModelForTokenClassification >>> model = AutoModelForTokenClassification.from_pretrained("castorini/afriberta_base") >>> tokenizer = AutoTokenizer.from_pretrained("castorini/afriberta_base") # we have to manually set the model max length because it is an imported sentencepiece model, which huggingface does not properly support right now >>> tokenizer.model_max_length = 512Limitations and bias
The model was trained on an aggregation of datasets from the BBC news website and Common Crawl.
For information on training procedures, please refer to the AfriBERTa paper or repository
@inproceedings{ogueji-etal-2021-small, title = "Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages", author = "Ogueji, Kelechi and Zhu, Yuxin and Lin, Jimmy", booktitle = "Proceedings of the 1st Workshop on Multilingual Representation Learning", month = nov, year = "2021", address = "Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.mrl-1.11", pages = "116--126", }