数据集:
tobiolatunji/afrispeech-200
任务:
自动语音识别语言:
en计算机处理:
monolingual大小:
10K<n<100K批注创建人:
expert-generated源数据集:
original许可:
cc-by-nc-sa-4.0AFRISPEECH-200 is a 200hr Pan-African speech corpus for clinical and general domain English accented ASR; a dataset with 120 African accents from 13 countries and 2,463 unique African speakers. Our goal is to raise awareness for and advance Pan-African English ASR research, especially for the clinical domain.
The datasets library allows you to load and pre-process your dataset in pure Python, at scale. The dataset can be downloaded and prepared in one call to your local drive by using the load_dataset function.
from datasets import load_dataset afrispeech = load_dataset("tobiolatunji/afrispeech-200", "all")
The entire dataset is ~120GB and may take about 2hrs to download depending on internet speed/bandwidth. If you have disk space or bandwidth limitations, you can use streaming mode described below to work with smaller subsets of the data.
Alterntively you are able to pass a config to the load_dataset function and download only a subset of the data corresponding to a specific accent of interest. The example provided below is isizulu .
For example, to download the isizulu config, simply specify the corresponding accent config name. The list of supported accents is provided in the accent list section below:
from datasets import load_dataset afrispeech = load_dataset("tobiolatunji/afrispeech-200", "isizulu", split="train")
Using the datasets library, you can also stream the dataset on-the-fly by adding a streaming=True argument to the load_dataset function call. Loading a dataset in streaming mode loads individual samples of the dataset at a time, rather than downloading the entire dataset to disk.
from datasets import load_dataset afrispeech = load_dataset("tobiolatunji/afrispeech-200", "isizulu", split="train", streaming=True) print(next(iter(afrispeech))) print(list(afrispeech.take(5)))
from datasets import load_dataset from torch.utils.data.sampler import BatchSampler, RandomSampler afrispeech = load_dataset("tobiolatunji/afrispeech-200", "isizulu", split="train") batch_sampler = BatchSampler(RandomSampler(afrispeech), batch_size=32, drop_last=False) dataloader = DataLoader(afrispeech, batch_sampler=batch_sampler)
from datasets import load_dataset from torch.utils.data import DataLoader afrispeech = load_dataset("tobiolatunji/afrispeech-200", "isizulu", split="train", streaming=True) dataloader = DataLoader(afrispeech, batch_size=32)
Note that till the end of the ongoing AfriSpeech ASR Challenge event (Feb - May 2023), the transcripts in the validation set are hidden and the test set will be unreleased till May 19, 2023.
To walk through a complete colab tutorial that finetunes a wav2vec2 model on the afrispeech-200 dataset with transformers , take a look at this colab notebook afrispeech/wav2vec2-colab-tutorial .
English (Accented)
A typical data point comprises the path to the audio file, called path and its transcription, called transcript . Some additional information about the speaker is provided.
{ 'speaker_id': 'b545a4ca235a7b72688a1c0b3eb6bde6', 'path': 'aad9bd69-7ca0-4db1-b650-1eeea17a0153/5dcb6ee086e392376cd3b7131a250397.wav', 'audio_id': 'aad9bd69-7ca0-4db1-b650-1eeea17a0153/5dcb6ee086e392376cd3b7131a250397', 'audio': { 'path': 'aad9bd69-7ca0-4db1-b650-1eeea17a0153/5dcb6ee086e392376cd3b7131a250397.wav', 'array': array([0.00018311, 0.00061035, 0.00012207, ..., 0.00192261, 0.00195312, 0.00216675]), 'sampling_rate': 44100}, 'transcript': 'His mother is in her 50 s and has hypertension .', 'age_group': '26-40', 'gender': 'Male', 'accent': 'yoruba', 'domain': 'clinical', 'country': 'US', 'duration': 3.241995464852608 }
speaker_id: An id for which speaker (voice) made the recording
path: The path to the audio file
audio: A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: dataset[0]["audio"] the audio file is automatically decoded and resampled to dataset.features["audio"].sampling_rate . Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the "audio" column, i.e. dataset[0]["audio"] should always be preferred over dataset["audio"][0] .
transcript: The sentence the user was prompted to speak
The speech material has been subdivided into portions for train, dev, and test.
Speech was recorded in a quiet environment with high quality microphone, speakers were asked to read one sentence at a time.
Train | Dev | Test | |
---|---|---|---|
# Speakers | 1466 | 247 | 750 |
# Seconds | 624228.83 | 31447.09 | 67559.10 |
# Hours | 173.4 | 8.74 | 18.77 |
# Accents | 71 | 45 | 108 |
Avg secs/speaker | 425.81 | 127.32 | 90.08 |
Avg num clips/speaker | 39.56 | 13.08 | 8.46 |
Avg num speakers/accent | 20.65 | 5.49 | 6.94 |
Avg secs/accent | 8791.96 | 698.82 | 625.55 |
# clips general domain | 21682 | 1407 | 2723 |
# clips clinical domain | 36318 | 1824 | 3623 |
Africa has a very low doctor-to-patient ratio. At very busy clinics, doctors could see 30+ patients per day-- a heavy patient burden compared with developed countries-- but productivity tools such as clinical automatic speech recognition (ASR) are lacking for these overworked clinicians. However, clinical ASR is mature, even ubiquitous, in developed nations, and clinician-reported performance of commercial clinical ASR systems is generally satisfactory. Furthermore, the recent performance of general domain ASR is approaching human accuracy. However, several gaps exist. Several publications have highlighted racial bias with speech-to-text algorithms and performance on minority accents lags significantly. To our knowledge, there is no publicly available research or benchmark on accented African clinical ASR, and speech data is non-existent for the majority of African accents. We release AfriSpeech, 200hrs of Pan-African speech, 67,577 clips from 2,463 unique speakers, across 120 indigenous accents from 13 countries for clinical and general domain ASR, a benchmark test set, with publicly available pre-trained models with SOTA performance on the AfriSpeech benchmark.
Country | Clips | Speakers | Duration (seconds) | Duration (hrs) |
---|---|---|---|---|
NG | 45875 | 1979 | 512646.88 | 142.40 |
KE | 8304 | 137 | 75195.43 | 20.89 |
ZA | 7870 | 223 | 81688.11 | 22.69 |
GH | 2018 | 37 | 18581.13 | 5.16 |
BW | 1391 | 38 | 14249.01 | 3.96 |
UG | 1092 | 26 | 10420.42 | 2.89 |
RW | 469 | 9 | 5300.99 | 1.47 |
US | 219 | 5 | 1900.98 | 0.53 |
TR | 66 | 1 | 664.01 | 0.18 |
ZW | 63 | 3 | 635.11 | 0.18 |
MW | 60 | 1 | 554.61 | 0.15 |
TZ | 51 | 2 | 645.51 | 0.18 |
LS | 7 | 1 | 78.40 | 0.02 |
Accent | Clips | Speakers | Duration (s) | Country | Splits |
---|---|---|---|---|---|
yoruba | 15407 | 683 | 161587.55 | US,NG | train,test,dev |
igbo | 8677 | 374 | 93035.79 | US,NG,ZA | train,test,dev |
swahili | 6320 | 119 | 55932.82 | KE,TZ,ZA,UG | train,test,dev |
hausa | 5765 | 248 | 70878.67 | NG | train,test,dev |
ijaw | 2499 | 105 | 33178.9 | NG | train,test,dev |
afrikaans | 2048 | 33 | 20586.49 | ZA | train,test,dev |
idoma | 1877 | 72 | 20463.6 | NG | train,test,dev |
zulu | 1794 | 52 | 18216.97 | ZA,TR,LS | dev,train,test |
setswana | 1588 | 39 | 16553.22 | BW,ZA | dev,test,train |
twi | 1566 | 22 | 14340.12 | GH | test,train,dev |
isizulu | 1048 | 48 | 10376.09 | ZA | test,train,dev |
igala | 919 | 31 | 9854.72 | NG | train,test |
izon | 838 | 47 | 9602.53 | NG | train,dev,test |
kiswahili | 827 | 6 | 8988.26 | KE | train,test |
ebira | 757 | 42 | 7752.94 | NG | train,test,dev |
luganda | 722 | 22 | 6768.19 | UG,BW,KE | test,dev,train |
urhobo | 646 | 32 | 6685.12 | NG | train,dev,test |
nembe | 578 | 16 | 6644.72 | NG | train,test,dev |
ibibio | 570 | 39 | 6489.29 | NG | train,test,dev |
pidgin | 514 | 20 | 5871.57 | NG | test,train,dev |
luhya | 508 | 4 | 4497.02 | KE | train,test |
kinyarwanda | 469 | 9 | 5300.99 | RW | train,test,dev |
xhosa | 392 | 12 | 4604.84 | ZA | train,dev,test |
tswana | 387 | 18 | 4148.58 | ZA,BW | train,test,dev |
esan | 380 | 13 | 4162.63 | NG | train,test,dev |
alago | 363 | 8 | 3902.09 | NG | train,test |
tshivenda | 353 | 5 | 3264.77 | ZA | test,train |
fulani | 312 | 18 | 5084.32 | NG | test,train |
isoko | 298 | 16 | 4236.88 | NG | train,test,dev |
akan (fante) | 295 | 9 | 2848.54 | GH | train,dev,test |
ikwere | 293 | 14 | 3480.43 | NG | test,train,dev |
sepedi | 275 | 10 | 2751.68 | ZA | dev,test,train |
efik | 269 | 11 | 2559.32 | NG | test,train,dev |
edo | 237 | 12 | 1842.32 | NG | train,test,dev |
luo | 234 | 4 | 2052.25 | UG,KE | test,train,dev |
kikuyu | 229 | 4 | 1949.62 | KE | train,test,dev |
bekwarra | 218 | 3 | 2000.46 | NG | train,test |
isixhosa | 210 | 9 | 2100.28 | ZA | train,dev,test |
hausa/fulani | 202 | 3 | 2213.53 | NG | test,train |
epie | 202 | 6 | 2320.21 | NG | train,test |
isindebele | 198 | 2 | 1759.49 | ZA | train,test |
venda and xitsonga | 188 | 2 | 2603.75 | ZA | train,test |
sotho | 182 | 4 | 2082.21 | ZA | dev,test,train |
akan | 157 | 6 | 1392.47 | GH | test,train |
nupe | 156 | 9 | 1608.24 | NG | dev,train,test |
anaang | 153 | 8 | 1532.56 | NG | test,dev |
english | 151 | 11 | 2445.98 | NG | dev,test |
afemai | 142 | 2 | 1877.04 | NG | train,test |
shona | 138 | 8 | 1419.98 | ZA,ZW | test,train,dev |
eggon | 137 | 5 | 1833.77 | NG | test |
luganda and kiswahili | 134 | 1 | 1356.93 | UG | train |
ukwuani | 133 | 7 | 1269.02 | NG | test |
sesotho | 132 | 10 | 1397.16 | ZA | train,dev,test |
benin | 124 | 4 | 1457.48 | NG | train,test |
kagoma | 123 | 1 | 1781.04 | NG | train |
nasarawa eggon | 120 | 1 | 1039.99 | NG | train |
tiv | 120 | 14 | 1084.52 | NG | train,test,dev |
south african english | 119 | 2 | 1643.82 | ZA | train,test |
borana | 112 | 1 | 1090.71 | KE | train |
swahili ,luganda ,arabic | 109 | 1 | 929.46 | UG | train |
ogoni | 109 | 4 | 1629.7 | NG | train,test |
mada | 109 | 2 | 1786.26 | NG | test |
bette | 106 | 4 | 930.16 | NG | train,test |
berom | 105 | 4 | 1272.99 | NG | dev,test |
bini | 104 | 4 | 1499.75 | NG | test |
ngas | 102 | 3 | 1234.16 | NG | train,test |
etsako | 101 | 4 | 1074.53 | NG | train,test |
okrika | 100 | 3 | 1887.47 | NG | train,test |
venda | 99 | 2 | 938.14 | ZA | train,test |
siswati | 96 | 5 | 1367.45 | ZA | dev,train,test |
damara | 92 | 1 | 674.43 | NG | train |
yoruba, hausa | 89 | 5 | 928.98 | NG | test |
southern sotho | 89 | 1 | 889.73 | ZA | train |
kanuri | 86 | 7 | 1936.78 | NG | test,dev |
itsekiri | 82 | 3 | 778.47 | NG | test,dev |
ekpeye | 80 | 2 | 922.88 | NG | test |
mwaghavul | 78 | 2 | 738.02 | NG | test |
bajju | 72 | 2 | 758.16 | NG | test |
luo, swahili | 71 | 1 | 616.57 | KE | train |
dholuo | 70 | 1 | 669.07 | KE | train |
ekene | 68 | 1 | 839.31 | NG | test |
jaba | 65 | 2 | 540.66 | NG | test |
ika | 65 | 4 | 576.56 | NG | test,dev |
angas | 65 | 1 | 589.99 | NG | test |
ateso | 63 | 1 | 624.28 | UG | train |
brass | 62 | 2 | 900.04 | NG | test |
ikulu | 61 | 1 | 313.2 | NG | test |
eleme | 60 | 2 | 1207.92 | NG | test |
chichewa | 60 | 1 | 554.61 | MW | train |
oklo | 58 | 1 | 871.37 | NG | test |
meru | 58 | 2 | 865.07 | KE | train,test |
agatu | 55 | 1 | 369.11 | NG | test |
okirika | 54 | 1 | 792.65 | NG | test |
igarra | 54 | 1 | 562.12 | NG | test |
ijaw(nembe) | 54 | 2 | 537.56 | NG | test |
khana | 51 | 2 | 497.42 | NG | test |
ogbia | 51 | 4 | 461.15 | NG | test,dev |
gbagyi | 51 | 4 | 693.43 | NG | test |
portuguese | 50 | 1 | 525.02 | ZA | train |
delta | 49 | 2 | 425.76 | NG | test |
bassa | 49 | 1 | 646.13 | NG | test |
etche | 49 | 1 | 637.48 | NG | test |
kubi | 46 | 1 | 495.21 | NG | test |
jukun | 44 | 2 | 362.12 | NG | test |
igbo and yoruba | 43 | 2 | 466.98 | NG | test |
urobo | 43 | 3 | 573.14 | NG | test |
kalabari | 42 | 5 | 305.49 | NG | test |
ibani | 42 | 1 | 322.34 | NG | test |
obolo | 37 | 1 | 204.79 | NG | test |
idah | 34 | 1 | 533.5 | NG | test |
bassa-nge/nupe | 31 | 3 | 267.42 | NG | test,dev |
yala mbembe | 29 | 1 | 237.27 | NG | test |
eket | 28 | 1 | 238.85 | NG | test |
afo | 26 | 1 | 171.15 | NG | test |
ebiobo | 25 | 1 | 226.27 | NG | test |
nyandang | 25 | 1 | 230.41 | NG | test |
ishan | 23 | 1 | 194.12 | NG | test |
bagi | 20 | 1 | 284.54 | NG | test |
estako | 20 | 1 | 480.78 | NG | test |
gerawa | 13 | 1 | 342.15 | NG | test |
[Needs More Information]
Who are the source language producers?[Needs More Information]
[Needs More Information]
Who are the annotators?[Needs More Information]
The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset.
[More Information Needed]
[More Information Needed]
Dataset provided for research purposes only. Please check dataset license for additional information.
The dataset was initially prepared by Intron and refined for public release by CLAIR Lab.
Public Domain, Creative Commons Attribution NonCommercial ShareAlike v4.0 ( CC BY-NC-SA 4.0 )
[More Information Needed]
Thanks to @tobiolatunji for adding this dataset.