s2t-small-librispeech-asr is a Speech to Text Transformer (S2T) model trained for automatic speech recognition (ASR). The S2T model was proposed in this paper and released in this repository
S2T is an end-to-end sequence-to-sequence transformer model. It is trained with standard autoregressive cross-entropy loss and generates the transcripts autoregressively.
This model can be used for end-to-end speech recognition (ASR). See the model hub to look for other S2T checkpoints.
As this a standard sequence to sequence transformer model, you can use the generate method to generate the transcripts by passing the speech features to the model.
Note: The Speech2TextProcessor object uses torchaudio to extract the filter bank features. Make sure to install the torchaudio package before running this example.
Note: The feature extractor depends on torchaudio and the tokenizer depends on sentencepiece so be sure to install those packages before running the examples.
You could either install those as extra speech dependancies with pip install transformers"[speech, sentencepiece]" or install the packages seperatly with pip install torchaudio sentencepiece .
import torch from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration from datasets import load_dataset model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr") processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr") ds = load_dataset( "patrickvonplaten/librispeech_asr_dummy", "clean", split="validation" ) input_features = processor( ds[0]["audio"]["array"], sampling_rate=16_000, return_tensors="pt" ).input_features # Batch size 1 generated_ids = model.generate(input_features=input_features) transcription = processor.batch_decode(generated_ids)Evaluation on LibriSpeech Test
The following script shows how to evaluate this model on the LibriSpeech "clean" and "other" test dataset.
from datasets import load_dataset from evaluate import load from transformers import Speech2TextForConditionalGeneration, Speech2TextProcessor librispeech_eval = load_dataset("librispeech_asr", "clean", split="test") # change to "other" for other test dataset wer = load("wer") model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr").to("cuda") processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr", do_upper_case=True) def map_to_pred(batch): features = processor(batch["audio"]["array"], sampling_rate=16000, padding=True, return_tensors="pt") input_features = features.input_features.to("cuda") attention_mask = features.attention_mask.to("cuda") gen_tokens = model.generate(input_features=input_features, attention_mask=attention_mask) batch["transcription"] = processor.batch_decode(gen_tokens, skip_special_tokens=True)[0] return batch result = librispeech_eval.map(map_to_pred, remove_columns=["audio"]) print("WER:", wer.compute(predictions=result["transcription"], references=result["text"]))
Result (WER) :
"clean" | "other" |
---|---|
4.3 | 9.0 |
The S2T-SMALL-LIBRISPEECH-ASR is trained on LibriSpeech ASR Corpus , a dataset consisting of approximately 1000 hours of 16kHz read English speech.
The speech data is pre-processed by extracting Kaldi-compliant 80-channel log mel-filter bank features automatically from WAV/FLAC audio files via PyKaldi or torchaudio. Further utterance-level CMVN (cepstral mean and variance normalization) is applied to each example.
The texts are lowercased and tokenized using SentencePiece and a vocabulary size of 10,000.
The model is trained with standard autoregressive cross-entropy loss and using SpecAugment . The encoder receives speech features, and the decoder generates the transcripts autoregressively.
@inproceedings{wang2020fairseqs2t, title = {fairseq S2T: Fast Speech-to-Text Modeling with fairseq}, author = {Changhan Wang and Yun Tang and Xutai Ma and Anne Wu and Dmytro Okhonko and Juan Pino}, booktitle = {Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL): System Demonstrations}, year = {2020}, }