数据集:

tobiolatunji/afrispeech-200

任务:

自动语音识别

语言:

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

crowdsourced expert-generated

批注创建人:

expert-generated

源数据集:

original

许可:

cc-by-nc-sa-4.0

数据集介绍文件清单

中文

Dataset Card for AfriSpeech-200

Dataset Summary

AFRISPEECH-200 is a 200hr Pan-African speech corpus for clinical and general domain English accented ASR; a dataset with 120 African accents from 13 countries and 2,463 unique African speakers. Our goal is to raise awareness for and advance Pan-African English ASR research, especially for the clinical domain.

How to use

The datasets library allows you to load and pre-process your dataset in pure Python, at scale. The dataset can be downloaded and prepared in one call to your local drive by using the load_dataset function.

from datasets import load_dataset

afrispeech = load_dataset("tobiolatunji/afrispeech-200", "all")

The entire dataset is ~120GB and may take about 2hrs to download depending on internet speed/bandwidth. If you have disk space or bandwidth limitations, you can use streaming mode described below to work with smaller subsets of the data.

Alterntively you are able to pass a config to the load_dataset function and download only a subset of the data corresponding to a specific accent of interest. The example provided below is isizulu .

For example, to download the isizulu config, simply specify the corresponding accent config name. The list of supported accents is provided in the accent list section below:

from datasets import load_dataset

afrispeech = load_dataset("tobiolatunji/afrispeech-200", "isizulu", split="train")

Using the datasets library, you can also stream the dataset on-the-fly by adding a streaming=True argument to the load_dataset function call. Loading a dataset in streaming mode loads individual samples of the dataset at a time, rather than downloading the entire dataset to disk.

from datasets import load_dataset

afrispeech = load_dataset("tobiolatunji/afrispeech-200", "isizulu", split="train", streaming=True)

print(next(iter(afrispeech)))
print(list(afrispeech.take(5)))

Local

from datasets import load_dataset
from torch.utils.data.sampler import BatchSampler, RandomSampler

afrispeech = load_dataset("tobiolatunji/afrispeech-200", "isizulu", split="train")
batch_sampler = BatchSampler(RandomSampler(afrispeech), batch_size=32, drop_last=False)
dataloader = DataLoader(afrispeech, batch_sampler=batch_sampler)

Streaming

from datasets import load_dataset
from torch.utils.data import DataLoader

afrispeech = load_dataset("tobiolatunji/afrispeech-200", "isizulu", split="train", streaming=True)
dataloader = DataLoader(afrispeech, batch_size=32)

Caveats

Note that till the end of the ongoing AfriSpeech ASR Challenge event (Feb - May 2023), the transcripts in the validation set are hidden and the test set will be unreleased till May 19, 2023.

Fine-tuning Colab tutorial

To walk through a complete colab tutorial that finetunes a wav2vec2 model on the afrispeech-200 dataset with transformers , take a look at this colab notebook afrispeech/wav2vec2-colab-tutorial .

Supported Tasks and Leaderboards

Automatic Speech Recognition
Speech Synthesis (Text-to-Speech)

Languages

English (Accented)

Dataset Structure

Data Instances

A typical data point comprises the path to the audio file, called path and its transcription, called transcript . Some additional information about the speaker is provided.

{
    'speaker_id': 'b545a4ca235a7b72688a1c0b3eb6bde6', 
    'path': 'aad9bd69-7ca0-4db1-b650-1eeea17a0153/5dcb6ee086e392376cd3b7131a250397.wav', 
    'audio_id': 'aad9bd69-7ca0-4db1-b650-1eeea17a0153/5dcb6ee086e392376cd3b7131a250397',
    'audio': {
        'path': 'aad9bd69-7ca0-4db1-b650-1eeea17a0153/5dcb6ee086e392376cd3b7131a250397.wav', 
        'array': array([0.00018311, 0.00061035, 0.00012207, ..., 0.00192261, 0.00195312, 0.00216675]), 
        'sampling_rate': 44100}, 
    'transcript': 'His mother is in her 50 s and has hypertension .', 
    'age_group': '26-40', 
    'gender': 'Male', 
    'accent': 'yoruba', 
    'domain': 'clinical', 
    'country': 'US', 
    'duration': 3.241995464852608
}

Data Fields

speaker_id: An id for which speaker (voice) made the recording
path: The path to the audio file
audio: A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: dataset[0]["audio"] the audio file is automatically decoded and resampled to dataset.features["audio"].sampling_rate . Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the "audio" column, i.e. dataset[0]["audio"] should always be preferred over dataset["audio"][0] .
transcript: The sentence the user was prompted to speak

Data Splits

The speech material has been subdivided into portions for train, dev, and test.

Speech was recorded in a quiet environment with high quality microphone, speakers were asked to read one sentence at a time.

Total Number of Unique Speakers: 2,463
Female/Male/Other Ratio: 57.11/42.41/0.48
Data was first split on speakers. Speakers in Train/Dev/Test do not cross partitions

Train	Dev	Test
# Speakers	1466	247	750
# Seconds	624228.83	31447.09	67559.10
# Hours	173.4	8.74	18.77
# Accents	71	45	108
Avg secs/speaker	425.81	127.32	90.08
Avg num clips/speaker	39.56	13.08	8.46
Avg num speakers/accent	20.65	5.49	6.94
Avg secs/accent	8791.96	698.82	625.55
# clips general domain	21682	1407	2723
# clips clinical domain	36318	1824	3623

Dataset Creation

Curation Rationale

Africa has a very low doctor-to-patient ratio. At very busy clinics, doctors could see 30+ patients per day-- a heavy patient burden compared with developed countries-- but productivity tools such as clinical automatic speech recognition (ASR) are lacking for these overworked clinicians. However, clinical ASR is mature, even ubiquitous, in developed nations, and clinician-reported performance of commercial clinical ASR systems is generally satisfactory. Furthermore, the recent performance of general domain ASR is approaching human accuracy. However, several gaps exist. Several publications have highlighted racial bias with speech-to-text algorithms and performance on minority accents lags significantly. To our knowledge, there is no publicly available research or benchmark on accented African clinical ASR, and speech data is non-existent for the majority of African accents. We release AfriSpeech, 200hrs of Pan-African speech, 67,577 clips from 2,463 unique speakers, across 120 indigenous accents from 13 countries for clinical and general domain ASR, a benchmark test set, with publicly available pre-trained models with SOTA performance on the AfriSpeech benchmark.

Source Data

Country Stats

Country	Clips	Speakers	Duration (seconds)	Duration (hrs)
NG	45875	1979	512646.88	142.40
KE	8304	137	75195.43	20.89
ZA	7870	223	81688.11	22.69
GH	2018	37	18581.13	5.16
BW	1391	38	14249.01	3.96
UG	1092	26	10420.42	2.89
RW	469	9	5300.99	1.47
US	219	5	1900.98	0.53
TR	66	1	664.01	0.18
ZW	63	3	635.11	0.18
MW	60	1	554.61	0.15
TZ	51	2	645.51	0.18
LS	7	1	78.40	0.02

Accent Stats

Accent	Clips	Speakers	Duration (s)	Country	Splits
yoruba	15407	683	161587.55	US,NG	train,test,dev
igbo	8677	374	93035.79	US,NG,ZA	train,test,dev
swahili	6320	119	55932.82	KE,TZ,ZA,UG	train,test,dev
hausa	5765	248	70878.67	NG	train,test,dev
ijaw	2499	105	33178.9	NG	train,test,dev
afrikaans	2048	33	20586.49	ZA	train,test,dev
idoma	1877	72	20463.6	NG	train,test,dev
zulu	1794	52	18216.97	ZA,TR,LS	dev,train,test
setswana	1588	39	16553.22	BW,ZA	dev,test,train
twi	1566	22	14340.12	GH	test,train,dev
isizulu	1048	48	10376.09	ZA	test,train,dev
igala	919	31	9854.72	NG	train,test
izon	838	47	9602.53	NG	train,dev,test
kiswahili	827	6	8988.26	KE	train,test
ebira	757	42	7752.94	NG	train,test,dev
luganda	722	22	6768.19	UG,BW,KE	test,dev,train
urhobo	646	32	6685.12	NG	train,dev,test
nembe	578	16	6644.72	NG	train,test,dev
ibibio	570	39	6489.29	NG	train,test,dev
pidgin	514	20	5871.57	NG	test,train,dev
luhya	508	4	4497.02	KE	train,test
kinyarwanda	469	9	5300.99	RW	train,test,dev
xhosa	392	12	4604.84	ZA	train,dev,test
tswana	387	18	4148.58	ZA,BW	train,test,dev
esan	380	13	4162.63	NG	train,test,dev
alago	363	8	3902.09	NG	train,test
tshivenda	353	5	3264.77	ZA	test,train
fulani	312	18	5084.32	NG	test,train
isoko	298	16	4236.88	NG	train,test,dev
akan (fante)	295	9	2848.54	GH	train,dev,test
ikwere	293	14	3480.43	NG	test,train,dev
sepedi	275	10	2751.68	ZA	dev,test,train
efik	269	11	2559.32	NG	test,train,dev
edo	237	12	1842.32	NG	train,test,dev
luo	234	4	2052.25	UG,KE	test,train,dev
kikuyu	229	4	1949.62	KE	train,test,dev
bekwarra	218	3	2000.46	NG	train,test
isixhosa	210	9	2100.28	ZA	train,dev,test
hausa/fulani	202	3	2213.53	NG	test,train
epie	202	6	2320.21	NG	train,test
isindebele	198	2	1759.49	ZA	train,test
venda and xitsonga	188	2	2603.75	ZA	train,test
sotho	182	4	2082.21	ZA	dev,test,train
akan	157	6	1392.47	GH	test,train
nupe	156	9	1608.24	NG	dev,train,test
anaang	153	8	1532.56	NG	test,dev
english	151	11	2445.98	NG	dev,test
afemai	142	2	1877.04	NG	train,test
shona	138	8	1419.98	ZA,ZW	test,train,dev
eggon	137	5	1833.77	NG	test
luganda and kiswahili	134	1	1356.93	UG	train
ukwuani	133	7	1269.02	NG	test
sesotho	132	10	1397.16	ZA	train,dev,test
benin	124	4	1457.48	NG	train,test
kagoma	123	1	1781.04	NG	train
nasarawa eggon	120	1	1039.99	NG	train
tiv	120	14	1084.52	NG	train,test,dev
south african english	119	2	1643.82	ZA	train,test
borana	112	1	1090.71	KE	train
swahili ,luganda ,arabic	109	1	929.46	UG	train
ogoni	109	4	1629.7	NG	train,test
mada	109	2	1786.26	NG	test
bette	106	4	930.16	NG	train,test
berom	105	4	1272.99	NG	dev,test
bini	104	4	1499.75	NG	test
ngas	102	3	1234.16	NG	train,test
etsako	101	4	1074.53	NG	train,test
okrika	100	3	1887.47	NG	train,test
venda	99	2	938.14	ZA	train,test
siswati	96	5	1367.45	ZA	dev,train,test
damara	92	1	674.43	NG	train
yoruba, hausa	89	5	928.98	NG	test
southern sotho	89	1	889.73	ZA	train
kanuri	86	7	1936.78	NG	test,dev
itsekiri	82	3	778.47	NG	test,dev
ekpeye	80	2	922.88	NG	test
mwaghavul	78	2	738.02	NG	test
bajju	72	2	758.16	NG	test
luo, swahili	71	1	616.57	KE	train
dholuo	70	1	669.07	KE	train
ekene	68	1	839.31	NG	test
jaba	65	2	540.66	NG	test
ika	65	4	576.56	NG	test,dev
angas	65	1	589.99	NG	test
ateso	63	1	624.28	UG	train
brass	62	2	900.04	NG	test
ikulu	61	1	313.2	NG	test
eleme	60	2	1207.92	NG	test
chichewa	60	1	554.61	MW	train
oklo	58	1	871.37	NG	test
meru	58	2	865.07	KE	train,test
agatu	55	1	369.11	NG	test
okirika	54	1	792.65	NG	test
igarra	54	1	562.12	NG	test
ijaw(nembe)	54	2	537.56	NG	test
khana	51	2	497.42	NG	test
ogbia	51	4	461.15	NG	test,dev
gbagyi	51	4	693.43	NG	test
portuguese	50	1	525.02	ZA	train
delta	49	2	425.76	NG	test
bassa	49	1	646.13	NG	test
etche	49	1	637.48	NG	test
kubi	46	1	495.21	NG	test
jukun	44	2	362.12	NG	test
igbo and yoruba	43	2	466.98	NG	test
urobo	43	3	573.14	NG	test
kalabari	42	5	305.49	NG	test
ibani	42	1	322.34	NG	test
obolo	37	1	204.79	NG	test
idah	34	1	533.5	NG	test
bassa-nge/nupe	31	3	267.42	NG	test,dev
yala mbembe	29	1	237.27	NG	test
eket	28	1	238.85	NG	test
afo	26	1	171.15	NG	test
ebiobo	25	1	226.27	NG	test
nyandang	25	1	230.41	NG	test
ishan	23	1	194.12	NG	test
bagi	20	1	284.54	NG	test
estako	20	1	480.78	NG	test
gerawa	13	1	342.15	NG	test

Initial Data Collection and Normalization

[Needs More Information]

Who are the source language producers?

[Needs More Information]

Annotations

Annotation process

[Needs More Information]

Who are the annotators?

[Needs More Information]

Personal and Sensitive Information

The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset.

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

Dataset provided for research purposes only. Please check dataset license for additional information.

Additional Information

Dataset Curators

The dataset was initially prepared by Intron and refined for public release by CLAIR Lab.

Licensing Information

Public Domain, Creative Commons Attribution NonCommercial ShareAlike v4.0 ( CC BY-NC-SA 4.0 )

Citation Information

[More Information Needed]

Contributions

Thanks to @tobiolatunji for adding this dataset.

作者:

tobiolatunji

数据集大小:

15.01 GB