数据集:

masakhane/afriqa

任务:

问答

语言:

language:bem

language:fon

计算机处理:

multilingual

大小:

10K<n<100K

其他:

cross-lingual question-answering qa

许可:

cc-by-sa-4.0

预印本库:

arxiv:2305.06897

数据集介绍文件清单

中文

Dataset Card for AfriQA

Dataset Summary

AfriQA is the first cross-lingual question answering (QA) dataset with a focus on African languages. The dataset includes over 12,000 XOR QA examples across 10 African languages, making it an invaluable resource for developing more equitable QA technology.

The train/validation/test sets are available for all the 10 languages.

Supported Tasks and Leaderboards

question-answering : The performance in this task is measured with F1 (higher is better) and Exact Match Accuracy .

Languages

There are 20 languages available :

Bemba (bem)
Fon (fon)
Hausa (hau)
Igbo (ibo)
Kinyarwanda (kin)
Swahili (swą)
Twi (twi)
Wolof (wol)
Yorùbá (yor)
Zulu (zul)

Dataset Structure

Data Instances

Data Format:
id : Question ID
question : Question in African Language
translated_question : Question translated into a pivot language (English/French)
answers : Answer in African Language
lang : Datapoint Language (African Language) e.g bem
split : Dataset Split
translated_answer : Answer in Pivot Language
translation_type : Translation type of question and answers

{   "id": 0, 
    "question": "Bushe icaalo ca Egypt caali tekwapo ne caalo cimbi?", 
    "translated_question": "Has the country of Egypt been colonized before?", 
    "answers": "['Emukwai']", 
    "lang": "bem", 
    "split": "dev", 
    "translated_answer": "['yes']", 
    "translation_type": "human_translation"
    }

Data Splits

For all languages, there are three splits.

The original splits were named train , dev and test and they correspond to the train , validation and test splits.

The splits have the following sizes :

Language	train	dev	test
Bemba	502	503	314
Fon	427	428	386
Hausa	435	436	300
Igbo	417	418	409
Kinyarwanda	407	409	347
Swahili	415	417	302
Twi	451	452	490
Wolof	503	504	334
Yoruba	360	361	332
Zulu	387	388	325
Total	4333	4346	3560

Dataset Creation

Curation Rationale

The dataset was introduced to introduce question-answering resources to 10 languages that were under-served for natural language processing.

[More Information Needed]

Source Data

...

Initial Data Collection and Normalization

...

Who are the source language producers?

...

Annotations

Annotation process

Details can be found here ...

Who are the annotators?

Annotators were recruited from Masakhane

Personal and Sensitive Information

...

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

Users should keep in mind that the dataset only contains news text, which might limit the applicability of the developed systems to other domains.

Additional Information

Dataset Curators

Licensing Information

The licensing status of the data is CC 4.0 Non-Commercial

Citation Information

Provide the BibTex -formatted reference for the dataset. For example:

@misc{ogundepo2023afriqa,
      title={AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages}, 
      author={Odunayo Ogundepo and Tajuddeen R. Gwadabe and Clara E. Rivera and Jonathan H. Clark and Sebastian Ruder and David Ifeoluwa Adelani and Bonaventure F. P. Dossou and Abdou Aziz DIOP and Claytone Sikasote and Gilles Hacheme and Happy Buzaaba and Ignatius Ezeani and Rooweither Mabuya and Salomey Osei and Chris Emezue and Albert Njoroge Kahira and Shamsuddeen H. Muhammad and Akintunde Oladipo and Abraham Toluwase Owodunni and Atnafu Lambebo Tonja and Iyanuoluwa Shode and Akari Asai and Tunde Oluwaseyi Ajayi and Clemencia Siro and Steven Arthur and Mofetoluwa Adeyemi and Orevaoghene Ahia and Aremu Anuoluwapo and Oyinkansola Awosan and Chiamaka Chukwuneke and Bernard Opoku and Awokoya Ayodele and Verrah Otiende and Christine Mwase and Boyd Sinkala and Andre Niyongabo Rubungo and Daniel A. Ajisafe and Emeka Felix Onwuegbuzia and Habib Mbow and Emile Niyomutabazi and Eunice Mukonde and Falalu Ibrahim Lawan and Ibrahim Said Ahmad and Jesujoba O. Alabi and Martin Namukombo and Mbonu Chinedu and Mofya Phiri and Neo Putini and Ndumiso Mngoma and Priscilla A. Amuok and Ruqayya Nasir Iro and Sonia Adhiambo},
      year={2023},
      eprint={2305.06897},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributions

Thanks to @ToluClassics for adding this dataset.

作者:

masakhane

数据集大小:

11.86 KB