数据集:
anton-l/superb
SUPERB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data.
The SUPERB leaderboard can be found here https://superbbenchmark.org/leaderboard and consists of the following tasks:
prPhoneme Recognition (PR) transcribes an utterance into the smallest content units. This task includes alignment modeling to avoid potentially inaccurate forced alignment. LibriSpeech train-clean-100/dev-clean/test-clean subsets are adopted in SUPERB for training/validation/testing. Phoneme transcriptions are obtained from the LibriSpeech official g2p-model-5 and the conversion script in Kaldi librispeech s5 recipe. The evaluation metric is phone error rate (PER).
asrAutomatic Speech Recognition (ASR) transcribes utterances into words. While PR analyzes the improvement in modeling phonetics, ASR reflects the significance of the improvement in a real-world scenario. LibriSpeech train-clean-100/devclean/test-clean subsets are used for training/validation/testing. The evaluation metric is word error rate (WER).
ksKeyword Spotting (KS) detects preregistered keywords by classifying utterances into a predefined set of words. The task is usually performed on-device for the fast response time. Thus, accuracy, model size, and inference time are all crucial. SUPERB uses the widely used Speech Commands dataset v1.0 for the task. The dataset consists of ten classes of keywords, a class for silence, and an unknown class to include the false positive. The evaluation metric is accuracy (ACC)
Example of usage:Use these auxillary functions to:
For other examples of handling long _silence_ clips see the S3PRL or TFDS implementations.
def map_to_array(example):
import soundfile as sf
speech_array, sample_rate = sf.read(example["file"])
example["speech"] = speech_array
example["sample_rate"] = sample_rate
return example
def sample_noise(example):
# Use this function to extract random 1 sec slices of each _silence_ utterance,
# e.g. inside `torch.utils.data.Dataset.__getitem__()`
from random import randint
if example["label"] == "_silence_":
random_offset = randint(0, len(example["speech"]) - example["sample_rate"] - 1)
example["speech"] = example["speech"][random_offset : random_offset + example["sample_rate"]]
return example
Query by Example Spoken Term Detection (QbE) detects a spoken term (query) in an audio database (documents) by binary discriminating a given pair of query and document into a match or not. The English subset in QUESST 2014 challenge is adopted since we focus on investigating English as the first step. The evaluation metric is maximum term weighted value (MTWV) which balances misses and false alarms.
icIntent Classification (IC) classifies utterances into predefined classes to determine the intent of speakers. SUPERB uses the Fluent Speech Commands dataset , where each utterance is tagged with three intent labels: action, object, and location. The evaluation metric is accuracy (ACC).
sfSlot Filling (SF) predicts a sequence of semantic slot-types from an utterance, like a slot-type FromLocation for a spoken word Taipei, which is known as a slot-value. Both slot-types and slot-values are essential for an SLU system to function. The evaluation metrics thus include slot-type F1 score and slotvalue CER. Audio SNIPS is adopted, which synthesized multi-speaker utterances for SNIPS. Following the standard split in SNIPS, US-accent speakers are further selected for training, and others are for validation/testing.
siSpeaker Identification (SI) classifies each utterance for its speaker identity as a multi-class classification, where speakers are in the same predefined set for both training and testing. The widely used VoxCeleb1 dataset is adopted, and the evaluation metric is accuracy (ACC).
asvAutomatic Speaker Verification (ASV) verifies whether the speakers of a pair of utterances match as a binary classification, and speakers in the testing set may not appear in the training set. Thus, ASV is more challenging than SID. VoxCeleb1 is used without VoxCeleb2 training data and noise augmentation. The evaluation metric is equal error rate (EER).
sdSpeaker Diarization (SD) predicts who is speaking when for each timestamp, and multiple speakers can speak simultaneously. The model has to encode rich speaker characteristics for each frame and should be able to represent mixtures of signals. LibriMix is adopted where LibriSpeech train-clean-100/dev-clean/test-clean are used to generate mixtures for training/validation/testing. We focus on the two-speaker scenario as the first step. The time-coded speaker labels were generated using alignments from Kaldi LibriSpeech ASR model. The evaluation metric is diarization error rate (DER).
Example of usageUse these auxiliary functions to:
def load_audio_file(example, frame_shift=160):
import soundfile as sf
example["array"], example["sample_rate"] = sf.read(
example["file"], start=example["start"] * frame_shift, stop=example["end"] * frame_shift
)
return example
def generate_label(example, frame_shift=160, num_speakers=2, rate=16000):
import numpy as np
start = example["start"]
end = example["end"]
frame_num = end - start
speakers = sorted({speaker["speaker_id"] for speaker in example["speakers"]})
label = np.zeros((frame_num, num_speakers), dtype=np.int32)
for speaker in example["speakers"]:
speaker_index = speakers.index(speaker["speaker_id"])
start_frame = np.rint(speaker["start"] * rate / frame_shift).astype(int)
end_frame = np.rint(speaker["end"] * rate / frame_shift).astype(int)
rel_start = rel_end = None
if start <= start_frame < end:
rel_start = start_frame - start
if start < end_frame <= end:
rel_end = end_frame - start
if rel_start is not None or rel_end is not None:
label[rel_start:rel_end, speaker_index] = 1
example["label"] = label
return example
Emotion Recognition (ER) predicts an emotion class for each utterance. The most widely used ER dataset IEMOCAP is adopted, and we follow the conventional evaluation protocol: we drop the unbalance emotion classes to leave the final four classes with a similar amount of data points and cross-validates on five folds of the standard splits. The evaluation metric is accuracy (ACC).
The language data in SUPERB is in English (BCP-47 en )
An example from each split looks like:
{'chapter_id': 1240,
'file': 'path/to/file.flac',
'audio': {'path': 'path/to/file.flac',
'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32),
'sampling_rate': 16000},
'id': '103-1240-0000',
'speaker_id': 103,
'text': 'CHAPTER ONE MISSUS RACHEL LYNDE IS SURPRISED MISSUS RACHEL LYNDE '
'LIVED JUST WHERE THE AVONLEA MAIN ROAD DIPPED DOWN INTO A LITTLE '
'HOLLOW FRINGED WITH ALDERS AND LADIES EARDROPS AND TRAVERSED BY A '
'BROOK'}
An example from each split looks like:
{
'file': '/path/yes/af7a8296_nohash_1.wav',
'audio': {'path': '/path/yes/af7a8296_nohash_1.wav',
'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32),
'sampling_rate': 16000},
'label': 0 # 'yes'
}
{
'file': "/path/wavs/speakers/2BqVo8kVB2Skwgyb/063aa8f0-4479-11e9-a9a5-5dbec3b8816a.wav",
'audio': {'path': '/path/wavs/speakers/2BqVo8kVB2Skwgyb/063aa8f0-4479-11e9-a9a5-5dbec3b8816a.wav',
'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32),
'sampling_rate': 16000},
'speaker_id': '2BqVo8kVB2Skwgyb',
'text': 'Turn the bedroom lights off',
'action': 3, # 'deactivate'
'object': 7, # 'lights'
'location': 0 # 'bedroom'
}
{
'file': '/path/wav/id10003/na8-QEFmj44/00003.wav',
'audio': {'path': '/path/wav/id10003/na8-QEFmj44/00003.wav',
'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32),
'sampling_rate': 16000},
'label': 2 # 'id10003'
}
An example from each split looks like:
{
'record_id': '1578-6379-0038_6415-111615-0009',
'file': 'path/to/file.wav',
'audio': {'path': 'path/to/file.wav',
'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32),
'sampling_rate': 16000},
'start': 0,
'end': 1590,
'speakers': [
{'speaker_id': '1578', 'start': 28, 'end': 657},
{'speaker_id': '6415', 'start': 28, 'end': 1576}
]
}
####Note abouth the audio fields
When accessing the audio column: dataset[0]["audio"] the audio file is automatically decoded and resampled to dataset.features["audio"].sampling_rate . Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the "audio" column, i.e. dataset[0]["audio"] should always be preferred over dataset["audio"][0] .
pr asrThe data fields in all splits are:
train | validation | test | |
---|---|---|---|
asr | 28539 | 2703 | 2620 |
train | validation | test | |
---|---|---|---|
ks | 51094 | 6798 | 3081 |
train | validation | test | |
---|---|---|---|
ic | 23132 | 3118 | 3793 |
train | validation | test | |
---|---|---|---|
si | 138361 | 6904 | 8251 |
The data is split into "train", "dev" and "test" sets, each containing the following number of examples:
train | dev | test | |
---|---|---|---|
sd | 13901 | 3014 | 3002 |
The data is split into 5 sets intended for 5-fold cross-validation:
session1 | session2 | session3 | session4 | session5 | |
---|---|---|---|---|---|
er | 1085 | 1023 | 1151 | 1031 | 1241 |
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
@article{DBLP:journals/corr/abs-2105-01051,
author = {Shu{-}Wen Yang and
Po{-}Han Chi and
Yung{-}Sung Chuang and
Cheng{-}I Jeff Lai and
Kushal Lakhotia and
Yist Y. Lin and
Andy T. Liu and
Jiatong Shi and
Xuankai Chang and
Guan{-}Ting Lin and
Tzu{-}Hsien Huang and
Wei{-}Cheng Tseng and
Ko{-}tik Lee and
Da{-}Rong Liu and
Zili Huang and
Shuyan Dong and
Shang{-}Wen Li and
Shinji Watanabe and
Abdelrahman Mohamed and
Hung{-}yi Lee},
title = {{SUPERB:} Speech processing Universal PERformance Benchmark},
journal = {CoRR},
volume = {abs/2105.01051},
year = {2021},
url = {https://arxiv.org/abs/2105.01051},
archivePrefix = {arXiv},
eprint = {2105.01051},
timestamp = {Thu, 01 Jul 2021 13:30:22 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2105-01051.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Note that each SUPERB dataset has its own citation. Please see the source to see
the correct citation for each contained dataset.
Thanks to @lewtun , @albertvillanova and @anton-l for adding this dataset.