模型:
voidful/wav2vec2-xlsr-multilingual-56
This model can be used for the task of automatic-speech-recognition
More information needed
The model should not be used to intentionally create hostile or alienating environments for people.
Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021) ). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
See the common_voice dataset card Fine-tuned facebook/wav2vec2-large-xlsr-53 on 56 language using the Common Voice .
More information needed
When using this model, make sure that your speech input is sampled at 16kHz.
More information needed
More information needed
Common Voice Languages | Num. of data | Hour | WER | CER |
---|---|---|---|---|
ar | 21744 | 81.5 | 75.29 | 31.23 |
as | 394 | 1.1 | 95.37 | 46.05 |
br | 4777 | 7.4 | 93.79 | 41.16 |
ca | 301308 | 692.8 | 24.80 | 10.39 |
cnh | 1563 | 2.4 | 68.11 | 23.10 |
cs | 9773 | 39.5 | 67.86 | 12.57 |
cv | 1749 | 5.9 | 95.43 | 34.03 |
cy | 11615 | 106.7 | 67.03 | 23.97 |
de | 262113 | 822.8 | 27.03 | 6.50 |
dv | 4757 | 18.6 | 92.16 | 30.15 |
el | 3717 | 11.1 | 94.48 | 58.67 |
en | 580501 | 1763.6 | 34.87 | 14.84 |
eo | 28574 | 162.3 | 37.77 | 6.23 |
es | 176902 | 337.7 | 19.63 | 5.41 |
et | 5473 | 35.9 | 86.87 | 20.79 |
eu | 12677 | 90.2 | 44.80 | 7.32 |
fa | 12806 | 290.6 | 53.81 | 15.09 |
fi | 875 | 2.6 | 93.78 | 27.57 |
fr | 314745 | 664.1 | 33.16 | 13.94 |
fy-NL | 6717 | 27.2 | 72.54 | 26.58 |
ga-IE | 1038 | 3.5 | 92.57 | 51.02 |
hi | 292 | 2.0 | 90.95 | 57.43 |
hsb | 980 | 2.3 | 89.44 | 27.19 |
hu | 4782 | 9.3 | 97.15 | 36.75 |
ia | 5078 | 10.4 | 52.00 | 11.35 |
id | 3965 | 9.9 | 82.50 | 22.82 |
it | 70943 | 178.0 | 39.09 | 8.72 |
ja | 1308 | 8.2 | 99.21 | 62.06 |
ka | 1585 | 4.0 | 90.53 | 18.57 |
ky | 3466 | 12.2 | 76.53 | 19.80 |
lg | 1634 | 17.1 | 98.95 | 43.84 |
lt | 1175 | 3.9 | 92.61 | 26.81 |
lv | 4554 | 6.3 | 90.34 | 30.81 |
mn | 4020 | 11.6 | 82.68 | 30.14 |
mt | 3552 | 7.8 | 84.18 | 22.96 |
nl | 14398 | 71.8 | 57.18 | 19.01 |
or | 517 | 0.9 | 90.93 | 27.34 |
pa-IN | 255 | 0.8 | 87.95 | 42.03 |
pl | 12621 | 112.0 | 56.14 | 12.06 |
pt | 11106 | 61.3 | 53.24 | 16.32 |
rm-sursilv | 2589 | 5.9 | 78.17 | 23.31 |
rm-vallader | 931 | 2.3 | 73.67 | 21.76 |
ro | 4257 | 8.7 | 83.84 | 21.95 |
ru | 23444 | 119.1 | 61.83 | 15.18 |
sah | 1847 | 4.4 | 94.38 | 38.46 |
sl | 2594 | 6.7 | 84.21 | 20.54 |
sv-SE | 4350 | 20.8 | 83.68 | 30.79 |
ta | 3788 | 18.4 | 84.19 | 21.60 |
th | 4839 | 11.7 | 141.87 | 37.16 |
tr | 3478 | 22.3 | 66.77 | 15.55 |
tt | 13338 | 26.7 | 86.80 | 33.57 |
uk | 7271 | 39.4 | 70.23 | 14.34 |
vi | 421 | 1.7 | 96.06 | 66.25 |
zh-CN | 27284 | 58.7 | 89.67 | 23.96 |
zh-HK | 12678 | 92.1 | 81.77 | 18.82 |
zh-TW | 6402 | 56.6 | 85.08 | 29.07 |
More information needed
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019) .
More information needed
More information needed
More information needed
More information needed
BibTeX:
More information needed
APA:
More information needed
More information needed
More information needed
voidful in collaboration with Ezi Ozoani and the Hugging Face team
More information needed
Use the code below to get started with the model.
Click to expand!pip install torchaudio !pip install datasets transformers !pip install asrp !wget -O lang_ids.pk https://huggingface.co/voidful/wav2vec2-xlsr-multilingual-56/raw/main/lang_ids.pk
import torchaudio from datasets import load_dataset, load_metric from transformers import ( Wav2Vec2ForCTC, Wav2Vec2Processor, AutoTokenizer, AutoModelWithLMHead ) import torch import re import sys import soundfile as sf model_name = "voidful/wav2vec2-xlsr-multilingual-56" device = "cuda" processor_name = "voidful/wav2vec2-xlsr-multilingual-56" import pickle with open("lang_ids.pk", 'rb') as output: lang_ids = pickle.load(output) model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device) processor = Wav2Vec2Processor.from_pretrained(processor_name) model.eval() def load_file_to_data(file,sampling_rate=16_000): batch = {} speech, _ = torchaudio.load(file) if sampling_rate != '16_000' or sampling_rate != '16000': resampler = torchaudio.transforms.Resample(orig_freq=sampling_rate, new_freq=16_000) batch["speech"] = resampler.forward(speech.squeeze(0)).numpy() batch["sampling_rate"] = resampler.new_freq else: batch["speech"] = speech.squeeze(0).numpy() batch["sampling_rate"] = '16000' return batch def predict(data): features = processor(data["speech"], sampling_rate=data["sampling_rate"], padding=True, return_tensors="pt") input_values = features.input_values.to(device) attention_mask = features.attention_mask.to(device) with torch.no_grad(): logits = model(input_values, attention_mask=attention_mask).logits decoded_results = [] for logit in logits: pred_ids = torch.argmax(logit, dim=-1) mask = pred_ids.ge(1).unsqueeze(-1).expand(logit.size()) vocab_size = logit.size()[-1] voice_prob = torch.nn.functional.softmax((torch.masked_select(logit, mask).view(-1,vocab_size)),dim=-1) comb_pred_ids = torch.argmax(voice_prob, dim=-1) decoded_results.append(processor.decode(comb_pred_ids)) return decoded_results def predict_lang_specific(data,lang_code): features = processor(data["speech"], sampling_rate=data["sampling_rate"], padding=True, return_tensors="pt") input_values = features.input_values.to(device) attention_mask = features.attention_mask.to(device) with torch.no_grad(): logits = model(input_values, attention_mask=attention_mask).logits decoded_results = [] for logit in logits: pred_ids = torch.argmax(logit, dim=-1) mask = ~pred_ids.eq(processor.tokenizer.pad_token_id).unsqueeze(-1).expand(logit.size()) vocab_size = logit.size()[-1] voice_prob = torch.nn.functional.softmax((torch.masked_select(logit, mask).view(-1,vocab_size)),dim=-1) filtered_input = pred_ids[pred_ids!=processor.tokenizer.pad_token_id].view(1,-1).to(device) if len(filtered_input[0]) == 0: decoded_results.append("") else: lang_mask = torch.empty(voice_prob.shape[-1]).fill_(0) lang_index = torch.tensor(sorted(lang_ids[lang_code])) lang_mask.index_fill_(0, lang_index, 1) lang_mask = lang_mask.to(device) comb_pred_ids = torch.argmax(lang_mask*voice_prob, dim=-1) decoded_results.append(processor.decode(comb_pred_ids)) return decoded_results predict(load_file_to_data('audio file path',sampling_rate=16_000)) # beware of the audio file sampling rate predict_lang_specific(load_file_to_data('audio file path',sampling_rate=16_000),'en') # beware of the audio file sampling rate
{{ get_started_code | default("More information needed", true)}}