模型:
facebook/s2t-wav2vec2-large-en-tr
s2t-wav2vec2-large-en-tr is a Speech to Text Transformer model trained for end-to-end Speech Translation (ST). The S2T2 model was proposed in Large-Scale Self- and Semi-Supervised Learning for Speech Translation and officially released in Fairseq .
S2T2 is a transformer-based seq2seq (speech encoder-decoder) model designed for end-to-end Automatic Speech Recognition (ASR) and Speech Translation (ST). It uses a pretrained Wav2Vec2 as the encoder and a transformer-based decoder. The model is trained with standard autoregressive cross-entropy loss and generates the translations autoregressively.
This model can be used for end-to-end English speech to Turkish text translation. See the model hub to look for other S2T2 checkpoints.
As this a standard sequence to sequence transformer model, you can use the generate method to generate the transcripts by passing the speech features to the model.
You can use the model directly via the ASR pipeline
from datasets import load_dataset from transformers import pipeline librispeech_en = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation") asr = pipeline("automatic-speech-recognition", model="facebook/s2t-wav2vec2-large-en-tr", feature_extractor="facebook/s2t-wav2vec2-large-en-tr") translation = asr(librispeech_en[0]["file"])
or step-by-step as follows:
import torch from transformers import Speech2Text2Processor, SpeechEncoderDecoder from datasets import load_dataset import soundfile as sf model = SpeechEncoderDecoder.from_pretrained("facebook/s2t-wav2vec2-large-en-tr") processor = Speech2Text2Processor.from_pretrained("facebook/s2t-wav2vec2-large-en-tr") def map_to_array(batch): speech, _ = sf.read(batch["file"]) batch["speech"] = speech return batch ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation") ds = ds.map(map_to_array) inputs = processor(ds["speech"][0], sampling_rate=16_000, return_tensors="pt") generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"]) transcription = processor.batch_decode(generated_ids)
CoVoST-V2 test results for en-tr (BLEU score): 17.5
For more information, please have a look at the official paper - especially row 10 of Table 2.
@article{DBLP:journals/corr/abs-2104-06678, author = {Changhan Wang and Anne Wu and Juan Miguel Pino and Alexei Baevski and Michael Auli and Alexis Conneau}, title = {Large-Scale Self- and Semi-Supervised Learning for Speech Translation}, journal = {CoRR}, volume = {abs/2104.06678}, year = {2021}, url = {https://arxiv.org/abs/2104.06678}, archivePrefix = {arXiv}, eprint = {2104.06678}, timestamp = {Thu, 12 Aug 2021 15:37:06 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-2104-06678.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }