模型:

facebook/xmod-base

英文

xmod-base

X-MOD是一个多语言掩码语言模型,训练数据来源于过滤后的CommonCrawl数据,含有81种语言。它在论文 Lifting the Curse of Multilinguality by Pre-training Modular Transformers (Pfeiffer et al., NAACL 2022)中被介绍,并于 this repository 时首次发布。

由于它是使用了特定语言的模块化组件(语言适配器)进行预训练的,所以X-MOD与以前的多语言模型如 XLM-R 有所不同。在微调过程中,每个变压器层中的语言适配器将被冻结。

用法

分词器

此模型重用了 XLM-R 的分词器,所以您可以按照以下方式加载分词器:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

输入语言

因为该模型使用了语言适配器,所以需要指定输入的语言,以激活正确的适配器:

from transformers import XmodModel

model = XmodModel.from_pretrained("jvamvas/xmod-base")
model.set_default_language("en_XX")

该模型中的语言适配器目录位于本模型卡片底部。

微调

在原始论文的实验中,微调过程中会冻结嵌入层和语言适配器。代码提供了这样做的方法:

model.freeze_embeddings_and_language_adapters()
# Fine-tune the model ...

跨语言传递

在微调完成后,可以通过激活目标语言的语言适配器来进行零样本跨语言传递测试:

model.set_default_language("de_DE")
# Evaluate the model on German examples ...

偏见、风险和限制

请参阅 XLM-R 的模型卡片,因为X-MOD具有类似的架构并且是在类似的训练数据上进行的训练。

引用

BibTeX:

@inproceedings{pfeiffer-etal-2022-lifting,
    title = "Lifting the Curse of Multilinguality by Pre-training Modular Transformers",
    author = "Pfeiffer, Jonas  and
      Goyal, Naman  and
      Lin, Xi  and
      Li, Xian  and
      Cross, James  and
      Riedel, Sebastian  and
      Artetxe, Mikel",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-main.255",
    doi = "10.18653/v1/2022.naacl-main.255",
    pages = "3479--3495"
}

语言

该模型包含以下语言适配器:

lang_id (Adapter index) Language code Language
0 en_XX English
1 id_ID Indonesian
2 vi_VN Vietnamese
3 ru_RU Russian
4 fa_IR Persian
5 sv_SE Swedish
6 ja_XX Japanese
7 fr_XX French
8 de_DE German
9 ro_RO Romanian
10 ko_KR Korean
11 hu_HU Hungarian
12 es_XX Spanish
13 fi_FI Finnish
14 uk_UA Ukrainian
15 da_DK Danish
16 pt_XX Portuguese
17 no_XX Norwegian
18 th_TH Thai
19 pl_PL Polish
20 bg_BG Bulgarian
21 nl_XX Dutch
22 zh_CN Chinese (simplified)
23 he_IL Hebrew
24 el_GR Greek
25 it_IT Italian
26 sk_SK Slovak
27 hr_HR Croatian
28 tr_TR Turkish
29 ar_AR Arabic
30 cs_CZ Czech
31 lt_LT Lithuanian
32 hi_IN Hindi
33 zh_TW Chinese (traditional)
34 ca_ES Catalan
35 ms_MY Malay
36 sl_SI Slovenian
37 lv_LV Latvian
38 ta_IN Tamil
39 bn_IN Bengali
40 et_EE Estonian
41 az_AZ Azerbaijani
42 sq_AL Albanian
43 sr_RS Serbian
44 kk_KZ Kazakh
45 ka_GE Georgian
46 tl_XX Tagalog
47 ur_PK Urdu
48 is_IS Icelandic
49 hy_AM Armenian
50 ml_IN Malayalam
51 mk_MK Macedonian
52 be_BY Belarusian
53 la_VA Latin
54 te_IN Telugu
55 eu_ES Basque
56 gl_ES Galician
57 mn_MN Mongolian
58 kn_IN Kannada
59 ne_NP Nepali
60 sw_KE Swahili
61 si_LK Sinhala
62 mr_IN Marathi
63 af_ZA Afrikaans
64 gu_IN Gujarati
65 cy_GB Welsh
66 eo_EO Esperanto
67 km_KH Central Khmer
68 ky_KG Kirghiz
69 uz_UZ Uzbek
70 ps_AF Pashto
71 pa_IN Punjabi
72 ga_IE Irish
73 ha_NG Hausa
74 am_ET Amharic
75 lo_LA Lao
76 ku_TR Kurdish
77 so_SO Somali
78 my_MM Burmese
79 or_IN Oriya
80 sa_IN Sanskrit