Tahsin-Mayeesha/wav2vec2-bn-300m | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

模型:

Tahsin-Mayeesha/wav2vec2-bn-300m

任务:

自动语音识别

类库:

PyTorch TensorBoard Transformers

数据集:

openslr SLR53 Harveenchadha/indic-text 3AHarveenchadha/indic-text 3ASLR53 3Aopenslr

语言:

其他:

wav2vec2 hf-asr-leaderboard openslr_SLR53 robust-speech-event Eval Results

许可:

apache-2.0

模型介绍文件清单

中文

This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m on the OPENSLR_SLR53 - bengali dataset. It achieves the following results on the evaluation set.

Without language model :

Wer: 0.3110
Cer : 0.072

With 5 gram language model trained on indic-text dataset :

Wer: 0.17776
Cer : 0.04394

Note : 10% of a total 218703 samples have been used for evaluation. Evaluation set has 21871 examples. Training was stopped after 30k steps. Output predictions are available under files section.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 7.5e-05
train_batch_size: 16
eval_batch_size: 16
gradient_accumulation_steps: 4
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 2000
mixed_precision_training: Native AMP

Framework versions

Transformers 4.16.0.dev0
Pytorch 1.10.1+cu102
Datasets 1.17.1.dev0
Tokenizers 0.11.0

Note : Training and evaluation script modified from https://huggingface.co/chmanoj/xls-r-300m-te and https://github.com/huggingface/transformers/tree/master/examples/research_projects/robust-speech-event . Bengali speech data was not available from common voice or librispeech multilingual datasets, so OpenSLR53 has been used.

Note 2 : Minimum audio duration of 0.1s has been used to filter the training data which excluded may be 10-20 samples.

作者:

Tahsin Mayeesha

数据集大小:

3.69 GB