模型:
facebook/xmod-base
X-MOD is a multilingual masked language model trained on filtered CommonCrawl data containing 81 languages. It was introduced in the paper Lifting the Curse of Multilinguality by Pre-training Modular Transformers (Pfeiffer et al., NAACL 2022) and first released in this repository .
Because it has been pre-trained with language-specific modular components ( language adapters ), X-MOD differs from previous multilingual models like XLM-R . For fine-tuning, the language adapters in each transformer layer are frozen.
This model reuses the tokenizer of XLM-R , so you can load the tokenizer as follows:
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
Because this model uses language adapters, you need to specify the language of your input so that the correct adapter can be activated:
from transformers import XmodModel model = XmodModel.from_pretrained("jvamvas/xmod-base") model.set_default_language("en_XX")
A directory of the language adapters in this model is found at the bottom of this model card.
In the experiments in the original paper, the embedding layer and the language adapters are frozen during fine-tuning. A method for doing this is provided in the code:
model.freeze_embeddings_and_language_adapters() # Fine-tune the model ...
After fine-tuning, zero-shot cross-lingual transfer can be tested by activating the language adapter of the target language:
model.set_default_language("de_DE") # Evaluate the model on German examples ...
Please refer to the model card of XLM-R , because X-MOD has a similar architecture and has been trained on similar training data.
BibTeX:
@inproceedings{pfeiffer-etal-2022-lifting, title = "Lifting the Curse of Multilinguality by Pre-training Modular Transformers", author = "Pfeiffer, Jonas and Goyal, Naman and Lin, Xi and Li, Xian and Cross, James and Riedel, Sebastian and Artetxe, Mikel", booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = jul, year = "2022", address = "Seattle, United States", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.naacl-main.255", doi = "10.18653/v1/2022.naacl-main.255", pages = "3479--3495" }
This model contains the following language adapters:
lang_id (Adapter index) | Language code | Language |
---|---|---|
0 | en_XX | English |
1 | id_ID | Indonesian |
2 | vi_VN | Vietnamese |
3 | ru_RU | Russian |
4 | fa_IR | Persian |
5 | sv_SE | Swedish |
6 | ja_XX | Japanese |
7 | fr_XX | French |
8 | de_DE | German |
9 | ro_RO | Romanian |
10 | ko_KR | Korean |
11 | hu_HU | Hungarian |
12 | es_XX | Spanish |
13 | fi_FI | Finnish |
14 | uk_UA | Ukrainian |
15 | da_DK | Danish |
16 | pt_XX | Portuguese |
17 | no_XX | Norwegian |
18 | th_TH | Thai |
19 | pl_PL | Polish |
20 | bg_BG | Bulgarian |
21 | nl_XX | Dutch |
22 | zh_CN | Chinese (simplified) |
23 | he_IL | Hebrew |
24 | el_GR | Greek |
25 | it_IT | Italian |
26 | sk_SK | Slovak |
27 | hr_HR | Croatian |
28 | tr_TR | Turkish |
29 | ar_AR | Arabic |
30 | cs_CZ | Czech |
31 | lt_LT | Lithuanian |
32 | hi_IN | Hindi |
33 | zh_TW | Chinese (traditional) |
34 | ca_ES | Catalan |
35 | ms_MY | Malay |
36 | sl_SI | Slovenian |
37 | lv_LV | Latvian |
38 | ta_IN | Tamil |
39 | bn_IN | Bengali |
40 | et_EE | Estonian |
41 | az_AZ | Azerbaijani |
42 | sq_AL | Albanian |
43 | sr_RS | Serbian |
44 | kk_KZ | Kazakh |
45 | ka_GE | Georgian |
46 | tl_XX | Tagalog |
47 | ur_PK | Urdu |
48 | is_IS | Icelandic |
49 | hy_AM | Armenian |
50 | ml_IN | Malayalam |
51 | mk_MK | Macedonian |
52 | be_BY | Belarusian |
53 | la_VA | Latin |
54 | te_IN | Telugu |
55 | eu_ES | Basque |
56 | gl_ES | Galician |
57 | mn_MN | Mongolian |
58 | kn_IN | Kannada |
59 | ne_NP | Nepali |
60 | sw_KE | Swahili |
61 | si_LK | Sinhala |
62 | mr_IN | Marathi |
63 | af_ZA | Afrikaans |
64 | gu_IN | Gujarati |
65 | cy_GB | Welsh |
66 | eo_EO | Esperanto |
67 | km_KH | Central Khmer |
68 | ky_KG | Kirghiz |
69 | uz_UZ | Uzbek |
70 | ps_AF | Pashto |
71 | pa_IN | Punjabi |
72 | ga_IE | Irish |
73 | ha_NG | Hausa |
74 | am_ET | Amharic |
75 | lo_LA | Lao |
76 | ku_TR | Kurdish |
77 | so_SO | Somali |
78 | my_MM | Burmese |
79 | or_IN | Oriya |
80 | sa_IN | Sanskrit |