Massively Multilingual Speech (MMS) - Finetuned LID
This checkpoint is a model fine-tuned for speech language identification (LID) and part of Facebook's
Massive Multilingual Speech project
.
This checkpoint is based on the
Wav2Vec2 architecture
and classifies raw audio input to a probability distribution over 1024 output classes (each class representing a language).
The checkpoint consists of
1 billion parameters
and has been fine-tuned from
facebook/mms-1b
on 1024 languages.
Table Of Content
-
Example
-
Supported Languages
-
Model details
-
Additional links
Example
This MMS checkpoint can be used with
Transformers
to identify
the spoken language of an audio. It can recognize the
following 1024 languages
.
Let's look at a simple example.
First, we install transformers and some other libraries
pip install torch accelerate torchaudio datasets
pip install --upgrade transformers
Note
: In order to use MMS you need to have at least
transformers >= 4.30
installed. If the
4.30
version
is not yet available
on PyPI
make sure to install
transformers
from
source:
pip install git+https://github.com/huggingface/transformers.git
Next, we load a couple of audio samples via
datasets
. Make sure that the audio data is sampled to 16000 kHz.
from datasets import load_dataset, Audio
# English
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
en_sample = next(iter(stream_data))["audio"]["array"]
# Arabic
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "ar", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
ar_sample = next(iter(stream_data))["audio"]["array"]
Next, we load the model and processor
from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor
import torch
model_id = "facebook/mms-lid-1024"
processor = AutoFeatureExtractor.from_pretrained(model_id)
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)
Now we process the audio data, pass the processed audio data to the model to classify it into a language, just like we usually do for Wav2Vec2 audio classification models such as
ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition
# English
inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs).logits
lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = model.config.id2label[lang_id]
# 'eng'
# Arabic
inputs = processor(ar_sample, sampling_rate=16_000, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs).logits
lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = model.config.id2label[lang_id]
# 'ara'
To see all the supported languages of a checkpoint, you can print out the language ids as follows:
processor.id2label.values()
For more details, about the architecture please have a look at
the official docs
.
Supported Languages
This model supports 1024 languages. Unclick the following to toogle all supported languages of this checkpoint in
ISO 639-3 code
.
You can find more details about the languages and their ISO 649-3 codes in the
MMS Language Coverage Overview
.
Click to toggle
-
ara
-
cmn
-
eng
-
spa
-
fra
-
mlg
-
swe
-
por
-
vie
-
ful
-
sun
-
asm
-
ben
-
zlm
-
kor
-
ind
-
hin
-
tuk
-
urd
-
aze
-
slv
-
mon
-
hau
-
tel
-
swh
-
bod
-
rus
-
tur
-
heb
-
mar
-
som
-
tgl
-
tat
-
tha
-
cat
-
ron
-
mal
-
bel
-
pol
-
yor
-
nld
-
bul
-
hat
-
afr
-
isl
-
amh
-
tam
-
hun
-
hrv
-
lit
-
cym
-
fas
-
mkd
-
ell
-
bos
-
deu
-
sqi
-
jav
-
kmr
-
nob
-
uzb
-
snd
-
lat
-
nya
-
grn
-
mya
-
orm
-
lin
-
hye
-
yue
-
pan
-
jpn
-
kaz
-
npi
-
kik
-
kat
-
guj
-
kan
-
tgk
-
ukr
-
ces
-
lav
-
bak
-
khm
-
cak
-
fao
-
glg
-
ltz
-
xog
-
lao
-
mlt
-
sin
-
aka
-
sna
-
che
-
mam
-
ita
-
quc
-
aiw
-
srp
-
mri
-
tuv
-
nno
-
pus
-
eus
-
kbp
-
gur
-
ory
-
lug
-
crh
-
bre
-
luo
-
nhx
-
slk
-
ewe
-
xsm
-
fin
-
rif
-
dan
-
saq
-
yid
-
yao
-
mos
-
quh
-
hne
-
xon
-
new
-
dtp
-
quy
-
est
-
ddn
-
dyu
-
ttq
-
bam
-
pse
-
uig
-
sck
-
ngl
-
tso
-
mup
-
dga
-
seh
-
lis
-
wal
-
ctg
-
mip
-
bfz
-
bxk
-
ceb
-
kru
-
war
-
khg
-
bbc
-
thl
-
nzi
-
vmw
-
mzi
-
ycl
-
zne
-
sid
-
asa
-
tpi
-
bmq
-
box
-
zpu
-
gof
-
nym
-
cla
-
bgq
-
bfy
-
hlb
-
qxl
-
teo
-
fon
-
sda
-
kfx
-
bfa
-
mag
-
tzh
-
pil
-
maj
-
maa
-
kdt
-
ksb
-
lns
-
btd
-
rej
-
pap
-
ayr
-
any
-
mnk
-
adx
-
gud
-
krc
-
onb
-
xal
-
ctd
-
nxq
-
ava
-
blt
-
lbw
-
hyw
-
udm
-
zar
-
tzo
-
kpv
-
san
-
xnj
-
kek
-
chv
-
kcg
-
kri
-
ati
-
bgw
-
mxt
-
ybb
-
btx
-
dgi
-
nhy
-
dnj
-
zpz
-
yba
-
lon
-
smo
-
men
-
ium
-
mgd
-
taq
-
nga
-
nsu
-
zaj
-
tly
-
prk
-
zpt
-
akb
-
mhr
-
mxb
-
nuj
-
obo
-
kir
-
bom
-
run
-
zpg
-
hwc
-
mnw
-
ubl
-
kin
-
xtm
-
hnj
-
mpm
-
rkt
-
miy
-
luc
-
mih
-
kne
-
mib
-
flr
-
myv
-
xmm
-
knk
-
iba
-
gux
-
pis
-
zmz
-
ses
-
dav
-
lif
-
qxr
-
dig
-
kdj
-
wsg
-
tir
-
gbm
-
mai
-
zpc
-
kus
-
nyy
-
mim
-
nan
-
nyn
-
gog
-
ngu
-
tbz
-
hoc
-
nyf
-
sus
-
guk
-
gwr
-
yaz
-
bcc
-
sbd
-
spp
-
hak
-
grt
-
kno
-
oss
-
suk
-
spy
-
nij
-
lsm
-
kaa
-
bem
-
rmy
-
kqn
-
nim
-
ztq
-
nus
-
bib
-
xtd
-
ach
-
mil
-
keo
-
mpg
-
gjn
-
zaq
-
kdh
-
dug
-
sah
-
awa
-
kff
-
dip
-
rim
-
nhe
-
pcm
-
kde
-
tem
-
quz
-
mfq
-
las
-
bba
-
kbr
-
taj
-
dyo
-
zao
-
lom
-
shk
-
dik
-
dgo
-
zpo
-
fij
-
bgc
-
xnr
-
bud
-
kac
-
laj
-
mev
-
maw
-
quw
-
kao
-
dag
-
ktb
-
lhu
-
zab
-
mgh
-
shn
-
otq
-
lob
-
pbb
-
oci
-
zyb
-
bsq
-
mhi
-
dzo
-
zas
-
guc
-
alz
-
ctu
-
wol
-
guw
-
mnb
-
nia
-
zaw
-
mxv
-
bci
-
sba
-
kab
-
dwr
-
nnb
-
ilo
-
mfe
-
srx
-
ruf
-
srn
-
zad
-
xpe
-
pce
-
ahk
-
bcl
-
myk
-
haw
-
mad
-
ljp
-
bky
-
gmv
-
nag
-
nav
-
nyo
-
kxm
-
nod
-
sag
-
zpl
-
sas
-
myx
-
sgw
-
old
-
irk
-
acf
-
mak
-
kfy
-
zai
-
mie
-
zpm
-
zpi
-
ote
-
jam
-
kpz
-
lgg
-
lia
-
nhi
-
mzm
-
bdq
-
xtn
-
mey
-
mjl
-
sgj
-
kdi
-
kxc
-
miz
-
adh
-
tap
-
hay
-
kss
-
pam
-
gor
-
heh
-
nhw
-
ziw
-
gej
-
yua
-
itv
-
shi
-
qvw
-
mrw
-
hil
-
mbt
-
pag
-
vmy
-
lwo
-
cce
-
kum
-
klu
-
ann
-
mbb
-
npl
-
zca
-
pww
-
toc
-
ace
-
mio
-
izz
-
kam
-
zaa
-
krj
-
bts
-
eza
-
zty
-
hns
-
kki
-
min
-
led
-
alw
-
tll
-
rng
-
pko
-
toi
-
iqw
-
ncj
-
toh
-
umb
-
mog
-
hno
-
wob
-
gxx
-
hig
-
nyu
-
kby
-
ban
-
syl
-
bxg
-
nse
-
xho
-
zae
-
mkw
-
nch
-
ibg
-
mas
-
qvz
-
bum
-
bgd
-
mww
-
epo
-
tzm
-
zul
-
bcq
-
lrc
-
xdy
-
tyv
-
ibo
-
loz
-
mza
-
abk
-
azz
-
guz
-
arn
-
ksw
-
lus
-
tos
-
gvr
-
top
-
ckb
-
mer
-
pov
-
lun
-
rhg
-
knc
-
sfw
-
bev
-
tum
-
lag
-
nso
-
bho
-
ndc
-
maf
-
gkp
-
bax
-
awn
-
ijc
-
qug
-
lub
-
srr
-
mni
-
zza
-
ige
-
dje
-
mkn
-
bft
-
tiv
-
otn
-
kck
-
kqs
-
gle
-
lua
-
pdt
-
swk
-
mgw
-
ebu
-
ada
-
lic
-
skr
-
gaa
-
mfa
-
vmk
-
mcn
-
bto
-
lol
-
bwr
-
unr
-
dzg
-
hdy
-
kea
-
bhi
-
glk
-
mua
-
ast
-
nup
-
sat
-
ktu
-
bhb
-
zpq
-
coh
-
bkm
-
gya
-
sgc
-
dks
-
ncl
-
tui
-
emk
-
urh
-
ego
-
ogo
-
tsc
-
idu
-
igb
-
ijn
-
njz
-
ngb
-
tod
-
jra
-
mrt
-
zav
-
tke
-
its
-
ady
-
bzw
-
kng
-
kmb
-
lue
-
jmx
-
tsn
-
bin
-
ble
-
gom
-
ven
-
sef
-
sco
-
her
-
iso
-
trp
-
glv
-
haq
-
toq
-
okr
-
kha
-
wof
-
rmn
-
sot
-
kaj
-
bbj
-
sou
-
mjt
-
trd
-
gno
-
mwn
-
igl
-
rag
-
eyo
-
div
-
efi
-
nde
-
mfv
-
mix
-
rki
-
kjg
-
fan
-
khw
-
wci
-
bjn
-
pmy
-
bqi
-
ina
-
hni
-
mjx
-
kuj
-
aoz
-
the
-
tog
-
tet
-
nuz
-
ajg
-
ccp
-
mau
-
ymm
-
fmu
-
tcz
-
xmc
-
nyk
-
ztg
-
knx
-
snk
-
zac
-
esg
-
srb
-
thq
-
pht
-
wes
-
rah
-
pnb
-
ssy
-
zpv
-
kpo
-
phr
-
atd
-
eto
-
xta
-
mxx
-
mui
-
uki
-
tkt
-
mgp
-
xsq
-
enq
-
nnh
-
qxp
-
zam
-
bug
-
bxr
-
maq
-
tdt
-
khb
-
mrr
-
kas
-
zgb
-
kmw
-
lir
-
vah
-
dar
-
ssw
-
hmd
-
jab
-
iii
-
peg
-
shr
-
brx
-
rwr
-
bmb
-
kmc
-
mji
-
dib
-
pcc
-
nbe
-
mrd
-
ish
-
kai
-
yom
-
zyn
-
hea
-
ewo
-
bas
-
hms
-
twh
-
kfq
-
thr
-
xtl
-
wbr
-
bfb
-
wtm
-
mjc
-
blk
-
lot
-
dhd
-
swv
-
wbm
-
zzj
-
kge
-
mgm
-
niq
-
zpj
-
bwx
-
bde
-
mtr
-
gju
-
kjp
-
mbz
-
haz
-
lpo
-
yig
-
qud
-
shy
-
gjk
-
ztp
-
nbl
-
aii
-
kun
-
say
-
mde
-
sjp
-
bns
-
brh
-
ywq
-
msi
-
anr
-
mrg
-
mjg
-
tan
-
tsg
-
tcy
-
kbl
-
mdr
-
mks
-
noe
-
tyz
-
zpa
-
ahr
-
aar
-
wuu
-
khr
-
kbd
-
kex
-
bca
-
nku
-
pwr
-
hsn
-
ort
-
ott
-
swi
-
kua
-
tdd
-
msm
-
bgp
-
nbm
-
mxy
-
abs
-
zlj
-
ebo
-
lea
-
dub
-
sce
-
xkb
-
vav
-
bra
-
ssb
-
sss
-
nhp
-
kad
-
kvx
-
lch
-
tts
-
zyj
-
kxp
-
lmn
-
qvi
-
lez
-
scl
-
cqd
-
ayb
-
xbr
-
nqg
-
dcc
-
cjk
-
bfr
-
zyg
-
mse
-
gru
-
mdv
-
bew
-
wti
-
arg
-
dso
-
zdj
-
pll
-
mig
-
qxs
-
bol
-
drs
-
anp
-
chw
-
bej
-
vmc
-
otx
-
xty
-
bjj
-
vmz
-
ibb
-
gby
-
twx
-
tig
-
thz
-
tku
-
hmz
-
pbm
-
mfn
-
nut
-
cyo
-
mjw
-
cjm
-
tlp
-
naq
-
rnd
-
stj
-
sym
-
jax
-
btg
-
tdg
-
sng
-
nlv
-
kvr
-
pch
-
fvr
-
mxs
-
wni
-
mlq
-
kfr
-
mdj
-
osi
-
nhn
-
ukw
-
tji
-
qvj
-
nih
-
bcy
-
hbb
-
zpx
-
hoj
-
cpx
-
ogc
-
cdo
-
bgn
-
bfs
-
vmx
-
tvn
-
ior
-
mxa
-
btm
-
anc
-
jit
-
mfb
-
mls
-
ets
-
goa
-
bet
-
ikw
-
pem
-
trf
-
daq
-
max
-
rad
-
njo
-
bnx
-
mxl
-
mbi
-
nba
-
zpn
-
zts
-
mut
-
hnd
-
mta
-
hav
-
hac
-
ryu
-
abr
-
yer
-
cld
-
zag
-
ndo
-
sop
-
vmm
-
gcf
-
chr
-
cbk
-
sbk
-
bhp
-
odk
-
mbd
-
nap
-
gbr
-
mii
-
czh
-
xti
-
vls
-
gdx
-
sxw
-
zaf
-
wem
-
mqh
-
ank
-
yaf
-
vmp
-
otm
-
sdh
-
anw
-
src
-
mne
-
wss
-
meh
-
kzc
-
tma
-
ttj
-
ots
-
ilp
-
zpr
-
saz
-
ogb
-
akl
-
nhg
-
pbv
-
rcf
-
cgg
-
mku
-
bez
-
mwe
-
mtb
-
gul
-
ifm
-
mdh
-
scn
-
lki
-
xmf
-
sgd
-
aba
-
cos
-
luz
-
zpy
-
stv
-
kjt
-
mbf
-
kmz
-
nds
-
mtq
-
tkq
-
aee
-
knn
-
mbs
-
mnp
-
ema
-
bar
-
unx
-
plk
-
psi
-
mzn
-
cja
-
sro
-
mdw
-
ndh
-
vmj
-
zpw
-
kfu
-
bgx
-
gsw
-
fry
-
zpe
-
zpd
-
bta
-
psh
-
zat
Model details
-
Developed by:
Vineel Pratap et al.
-
Model type:
Multi-Lingual Automatic Speech Recognition model
-
Language(s):
1024 languages, see
supported languages
-
License:
CC-BY-NC 4.0 license
-
Num parameters
: 1 billion
-
Audio sampling rate
: 16,000 kHz
-
Cite as:
@article{pratap2023mms,
title={Scaling Speech Technology to 1,000+ Languages},
author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel-Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei-Ning Hsu and Alexis Conneau and Michael Auli},
journal={arXiv},
year={2023}
}
Additional Links