数据集:
edinburghcstr/ami
The AMI Meeting Corpus consists of 100 hours of meeting recordings. The recordings use a range of signals synchronized to a common timeline. These include close-talking and far-field microphones, individual and room-view video cameras, and output from a slide projector and an electronic whiteboard. During the meetings, the participants also have unsynchronized pens available to them that record what is written. The meetings were recorded in English using three different rooms with different acoustic properties, and include mostly non-native speakers.
Note : This dataset corresponds to the data-processing of KALDI's AMI S5 recipe . This means text is normalized and the audio data is chunked according to the scripts above! To make the user experience as simply as possible, we provide the already chunked data to the user here so that the following can be done:
from datasets import load_dataset ds = load_dataset("edinburghcstr/ami", "ihm") print(ds)
gives:
DatasetDict({ train: Dataset({ features: ['meeting_id', 'audio_id', 'text', 'audio', 'begin_time', 'end_time', 'microphone_id', 'speaker_id'], num_rows: 108502 }) validation: Dataset({ features: ['meeting_id', 'audio_id', 'text', 'audio', 'begin_time', 'end_time', 'microphone_id', 'speaker_id'], num_rows: 13098 }) test: Dataset({ features: ['meeting_id', 'audio_id', 'text', 'audio', 'begin_time', 'end_time', 'microphone_id', 'speaker_id'], num_rows: 12643 }) })
ds["train"][0]
automatically loads the audio into memory:
{'meeting_id': 'EN2001a', 'audio_id': 'AMI_EN2001a_H00_MEE068_0000557_0000594', 'text': 'OKAY', 'audio': {'path': '/cache/dir/path/downloads/extracted/2d75d5b3e8a91f44692e2973f08b4cac53698f92c2567bd43b41d19c313a5280/EN2001a/train_ami_en2001a_h00_mee068_0000557_0000594.wav', 'array': array([0. , 0. , 0. , ..., 0.00033569, 0.00030518, 0.00030518], dtype=float32), 'sampling_rate': 16000}, 'begin_time': 5.570000171661377, 'end_time': 5.940000057220459, 'microphone_id': 'H00', 'speaker_id': 'MEE068'}
The dataset was tested for correctness by fine-tuning a Wav2Vec2-Large model on it, more explicitly the wav2vec2-large-lv60 checkpoint .
As can be seen in this experiments, training the model for less than 2 epochs gives
Result (WER) :
"dev" | "eval" |
---|---|
25.27 | 25.21 |
as can be seen here .
The results are in-line with results of published papers:
You can run run.sh to reproduce the result.
[More Information Needed]
Thanks to @sanchit-gandhi , @patrickvonplaten , and @polinaeterna for adding this dataset.