模型:
xlm-mlm-tlm-xnli15-1024
The XLM model was proposed in Cross-lingual Language Model Pretraining by Guillaume Lample, Alexis Conneau. xlm-mlm-tlm-xnli15-1024 is a transformer pretrained using a masked language modeling (MLM) objective in combination with a translation language modeling (TLM) objective and then fine-tuned on the English NLI dataset. The model developers evaluated the capacity of the model to make correct predictions in all 15 XNLI languages (see the XNLI data card for further information on XNLI).
The model is a language model. The model can be used for cross-lingual text classification. Though the model is fine-tuned based on English text data, the model's ability to classify sentences in 14 other languages has been evaluated (see Evaluation ).
This model can be used for downstream tasks related to natural language inference in different languages. For more information, see the associated paper .
The model should not be used to intentionally create hostile or alienating environments for people.
Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021) ).
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
Training details are culled from the associated paper . See the paper for links, citations, and further details. Also see the associated GitHub Repo for further details.
The model developers write:
We use WikiExtractor2 to extract raw sentences from Wikipedia dumps and use them as mono-lingual data for the CLM and MLM objectives. For the TLM objective, we only use parallel data that involves English, similar to Conneau et al. (2018b).
For fine-tuning, the developers used the English NLI dataset (see the XNLI data card ).
The model developers write:
We use fastBPE to learn BPE codes and split words into subword units. The BPE codes are learned on the concatenation of sentences sampled from all languages, following the method presented in Section 3.1.
The model developers write:
We use a Transformer architecture with 1024 hidden units, 8 heads, GELU activations (Hendrycks and Gimpel, 2016), a dropout rate of 0.1 and learned positional embeddings. We train our models with the Adam optimizer (Kingma and Ba, 2014), a linear warm-up (Vaswani et al., 2017) and learning rates varying from 10^−4 to 5.10^−4.
For the CLM and MLM objectives, we use streams of 256 tokens and a mini-batches of size 64. Unlike Devlin et al. (2018), a sequence in a mini-batch can contain more than two consecutive sentences, as explained in Section 3.2. For the TLM objective, we sample mini-batches of 4000 tokens composed of sentences with similar lengths. We use the averaged perplexity over languages as a stopping criterion for training. For machine translation, we only use 6 layers, and we create mini-batches of 2000 tokens.
When fine-tuning on XNLI, we use mini-batches of size 8 or 16, and we clip the sentence length to 256 words. We use 80k BPE splits and a vocabulary of 95k and train a 12-layer model on the Wikipedias of the XNLI languages. We sample the learning rate of the Adam optimizer with values from 5.10−4 to 2.10−4, and use small evaluation epochs of 20000 random samples. We use the first hidden state of the last layer of the transformer as input to the randomly initialized final linear classifier, and fine-tune all parameters. In our experiments, using either max-pooling or mean-pooling over the last layer did not work bet- ter than using the first hidden state.
We implement all our models in Py-Torch (Paszke et al., 2017), and train them on 64 Volta GPUs for the language modeling tasks, and 8 GPUs for the MT tasks. We use float16 operations to speed up training and to reduce the memory usage of our models.
After fine-tuning the model on the English NLI dataset, the model developers evaluated the capacity of the model to make correct predictions in the 15 XNLI languages using the XNLI data and the metric of test accuracy.See the associated paper for further details.
Language | en | fr | es | de | el | bg | ru | tr | ar | vi | th | zh | hi | sw | ur |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Accuracy | 85.0 | 78.7 | 78.9 | 77.8 | 76.6 | 77.4 | 75.3 | 72.5 | 73.1 | 76.1 | 73.2 | 76.5 | 69.6 | 68.4 | 67.3 |
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019) .
Details are culled from the associated paper . See the paper for links, citations, and further details. Also see the associated GitHub Repo for further details.
xlm-mlm-tlm-xnli15-1024 is a transformer pretrained using a masked language modeling (MLM) objective in combination with a translation language modeling (TLM) objective and then fine-tuned on the English NLI dataset. About the TLM objective, the developers write:
We introduce a new translation language modeling (TLM) objective for improving cross-lingual pretraining. Our TLM objective is an extension of MLM, where instead of considering monolingual text streams, we concatenate parallel sentences as illustrated in Figure 1. We randomly mask words in both the source and target sentences. To predict a word masked in an English sentence, the model can either attend to surrounding English words or to the French translation, encouraging the model to align the English and French representations.
The developers write:
We implement all our models in PyTorch (Paszke et al., 2017), and train them on 64 Volta GPUs for the language modeling tasks, and 8 GPUs for the MT tasks. We use float16 operations to speed up training and to reduce the memory usage of our models.
BibTeX:
@article{lample2019cross, title={Cross-lingual language model pretraining}, author={Lample, Guillaume and Conneau, Alexis}, journal={arXiv preprint arXiv:1901.07291}, year={2019} }
APA:
This model card was written by the team at Hugging Face.
This model uses language embeddings to specify the language used at inference. See the Hugging Face Multilingual Models for Inference docs for further details.