中文

KBLab's wav2vec 2.0 large VoxRex Swedish (C) with 4-gram model

Training of the acoustic model is the work of KBLab. See VoxRex-C for more details. This repo extends the acoustic model with a social media 4-gram language model for boosted performance.

Model description

VoxRex-C is extended with a 4-gram language model estimated from a subset extracted from The Swedish Culturomics Gigaword Corpus from Språkbanken. The subset contains 40M words from the social media genre between 2010 and 2015.

How to use

Simple usage example with pipeline
import torch
from transformers import pipeline

# Load the model. Using GPU if available
model_name = 'viktor-enzell/wav2vec2-large-voxrex-swedish-4gram'
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
pipe = pipeline(model=model_name).to(device)

# Run inference on an audio file
output = pipe('path/to/audio.mp3')['text']
More verbose usage example with audio pre-processing

Example of transcribing 1% of the Common Voice test split. The model expects 16kHz audio, so audio with another sampling rate is resampled to 16kHz.

from transformers import Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM
from datasets import load_dataset
import torch
import torchaudio.functional as F

# Import model and processor. Using GPU if available
model_name = 'viktor-enzell/wav2vec2-large-voxrex-swedish-4gram'
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device);
processor = Wav2Vec2ProcessorWithLM.from_pretrained(model_name)

# Import and process speech data 
common_voice = load_dataset('common_voice', 'sv-SE', split='test[:1%]')

def speech_file_to_array(sample):
    # Convert speech file to array and downsample to 16 kHz
    sampling_rate = sample['audio']['sampling_rate']
    sample['speech'] = F.resample(torch.tensor(sample['audio']['array']), sampling_rate, 16_000)
    return sample

common_voice = common_voice.map(speech_file_to_array)

# Run inference
inputs = processor(common_voice['speech'], sampling_rate=16_000, return_tensors='pt', padding=True).to(device)

with torch.no_grad():
    logits = model(**inputs).logits

transcripts = processor.batch_decode(logits.cpu().numpy()).text

Training procedure

Text data for the n-gram model is pre-processed by removing characters not part of the wav2vec 2.0 vocabulary and uppercasing all characters. After pre-processing and storing each text sample on a new line in a text file, a KenLM model is estimated. See this tutorial for more details.

Evaluation results

The model was evaluated on the full Common Voice test set version 6.1. VoxRex-C achieved a WER of 9.03% without the language model and 6.47% with the language model.