模型:
facebook/xmod-base
X-MOD是一个多语言掩码语言模型,训练数据来源于过滤后的CommonCrawl数据,含有81种语言。它在论文 Lifting the Curse of Multilinguality by Pre-training Modular Transformers (Pfeiffer et al., NAACL 2022)中被介绍,并于 this repository 时首次发布。
由于它是使用了特定语言的模块化组件(语言适配器)进行预训练的,所以X-MOD与以前的多语言模型如 XLM-R 有所不同。在微调过程中,每个变压器层中的语言适配器将被冻结。
此模型重用了 XLM-R 的分词器,所以您可以按照以下方式加载分词器:
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
因为该模型使用了语言适配器,所以需要指定输入的语言,以激活正确的适配器:
from transformers import XmodModel model = XmodModel.from_pretrained("jvamvas/xmod-base") model.set_default_language("en_XX")
该模型中的语言适配器目录位于本模型卡片底部。
在原始论文的实验中,微调过程中会冻结嵌入层和语言适配器。代码提供了这样做的方法:
model.freeze_embeddings_and_language_adapters() # Fine-tune the model ...
在微调完成后,可以通过激活目标语言的语言适配器来进行零样本跨语言传递测试:
model.set_default_language("de_DE") # Evaluate the model on German examples ...
请参阅 XLM-R 的模型卡片,因为X-MOD具有类似的架构并且是在类似的训练数据上进行的训练。
BibTeX:
@inproceedings{pfeiffer-etal-2022-lifting, title = "Lifting the Curse of Multilinguality by Pre-training Modular Transformers", author = "Pfeiffer, Jonas and Goyal, Naman and Lin, Xi and Li, Xian and Cross, James and Riedel, Sebastian and Artetxe, Mikel", booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = jul, year = "2022", address = "Seattle, United States", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.naacl-main.255", doi = "10.18653/v1/2022.naacl-main.255", pages = "3479--3495" }
该模型包含以下语言适配器:
lang_id (Adapter index) | Language code | Language |
---|---|---|
0 | en_XX | English |
1 | id_ID | Indonesian |
2 | vi_VN | Vietnamese |
3 | ru_RU | Russian |
4 | fa_IR | Persian |
5 | sv_SE | Swedish |
6 | ja_XX | Japanese |
7 | fr_XX | French |
8 | de_DE | German |
9 | ro_RO | Romanian |
10 | ko_KR | Korean |
11 | hu_HU | Hungarian |
12 | es_XX | Spanish |
13 | fi_FI | Finnish |
14 | uk_UA | Ukrainian |
15 | da_DK | Danish |
16 | pt_XX | Portuguese |
17 | no_XX | Norwegian |
18 | th_TH | Thai |
19 | pl_PL | Polish |
20 | bg_BG | Bulgarian |
21 | nl_XX | Dutch |
22 | zh_CN | Chinese (simplified) |
23 | he_IL | Hebrew |
24 | el_GR | Greek |
25 | it_IT | Italian |
26 | sk_SK | Slovak |
27 | hr_HR | Croatian |
28 | tr_TR | Turkish |
29 | ar_AR | Arabic |
30 | cs_CZ | Czech |
31 | lt_LT | Lithuanian |
32 | hi_IN | Hindi |
33 | zh_TW | Chinese (traditional) |
34 | ca_ES | Catalan |
35 | ms_MY | Malay |
36 | sl_SI | Slovenian |
37 | lv_LV | Latvian |
38 | ta_IN | Tamil |
39 | bn_IN | Bengali |
40 | et_EE | Estonian |
41 | az_AZ | Azerbaijani |
42 | sq_AL | Albanian |
43 | sr_RS | Serbian |
44 | kk_KZ | Kazakh |
45 | ka_GE | Georgian |
46 | tl_XX | Tagalog |
47 | ur_PK | Urdu |
48 | is_IS | Icelandic |
49 | hy_AM | Armenian |
50 | ml_IN | Malayalam |
51 | mk_MK | Macedonian |
52 | be_BY | Belarusian |
53 | la_VA | Latin |
54 | te_IN | Telugu |
55 | eu_ES | Basque |
56 | gl_ES | Galician |
57 | mn_MN | Mongolian |
58 | kn_IN | Kannada |
59 | ne_NP | Nepali |
60 | sw_KE | Swahili |
61 | si_LK | Sinhala |
62 | mr_IN | Marathi |
63 | af_ZA | Afrikaans |
64 | gu_IN | Gujarati |
65 | cy_GB | Welsh |
66 | eo_EO | Esperanto |
67 | km_KH | Central Khmer |
68 | ky_KG | Kirghiz |
69 | uz_UZ | Uzbek |
70 | ps_AF | Pashto |
71 | pa_IN | Punjabi |
72 | ga_IE | Irish |
73 | ha_NG | Hausa |
74 | am_ET | Amharic |
75 | lo_LA | Lao |
76 | ku_TR | Kurdish |
77 | so_SO | Somali |
78 | my_MM | Burmese |
79 | or_IN | Oriya |
80 | sa_IN | Sanskrit |