模型:
viktor-enzell/wav2vec2-large-voxrex-swedish-4gram
Training of the acoustic model is the work of KBLab. See VoxRex-C for more details. This repo extends the acoustic model with a social media 4-gram language model for boosted performance.
VoxRex-C is extended with a 4-gram language model estimated from a subset extracted from The Swedish Culturomics Gigaword Corpus from Språkbanken. The subset contains 40M words from the social media genre between 2010 and 2015.
import torch from transformers import pipeline # Load the model. Using GPU if available model_name = 'viktor-enzell/wav2vec2-large-voxrex-swedish-4gram' device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') pipe = pipeline(model=model_name).to(device) # Run inference on an audio file output = pipe('path/to/audio.mp3')['text']More verbose usage example with audio pre-processing
Example of transcribing 1% of the Common Voice test split. The model expects 16kHz audio, so audio with another sampling rate is resampled to 16kHz.
from transformers import Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM from datasets import load_dataset import torch import torchaudio.functional as F # Import model and processor. Using GPU if available model_name = 'viktor-enzell/wav2vec2-large-voxrex-swedish-4gram' device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device); processor = Wav2Vec2ProcessorWithLM.from_pretrained(model_name) # Import and process speech data common_voice = load_dataset('common_voice', 'sv-SE', split='test[:1%]') def speech_file_to_array(sample): # Convert speech file to array and downsample to 16 kHz sampling_rate = sample['audio']['sampling_rate'] sample['speech'] = F.resample(torch.tensor(sample['audio']['array']), sampling_rate, 16_000) return sample common_voice = common_voice.map(speech_file_to_array) # Run inference inputs = processor(common_voice['speech'], sampling_rate=16_000, return_tensors='pt', padding=True).to(device) with torch.no_grad(): logits = model(**inputs).logits transcripts = processor.batch_decode(logits.cpu().numpy()).text
Text data for the n-gram model is pre-processed by removing characters not part of the wav2vec 2.0 vocabulary and uppercasing all characters. After pre-processing and storing each text sample on a new line in a text file, a KenLM model is estimated. See this tutorial for more details.
The model was evaluated on the full Common Voice test set version 6.1. VoxRex-C achieved a WER of 9.03% without the language model and 6.47% with the language model.