模型:
xlm-roberta-large-finetuned-conll03-english
任务:
标记分类语言:
multilingualThe XLM-RoBERTa model was proposed in Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook's RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data. This model is XLM-RoBERTa-large fine-tuned with the conll2003 dataset in English.
The model is a language model. The model can be used for token classification, a natural language understanding task in which a label is assigned to some tokens in a text.
Potential downstream use cases include Named Entity Recognition (NER) and Part-of-Speech (PoS) tagging. To learn more about token classification and other potential downstream use cases, see the Hugging Face token classification docs .
The model should not be used to intentionally create hostile or alienating environments for people.
CONTENT WARNING: Readers should be made aware that language generated by this model may be disturbing or offensive to some and may propagate historical and current stereotypes.
Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021) ). In the context of tasks relevant to this model, Mishra et al. (2020) explore social biases in NER systems for English and find that there is systematic bias in existing NER systems in that they fail to identify named entities from different demographic groups (though this paper did not look at BERT). For example, using a sample sentence from Mishra et al. (2020) :
>>> from transformers import pipeline >>> tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large-finetuned-conll03-english") >>> model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english") >>> classifier = pipeline("ner", model=model, tokenizer=tokenizer) >>> classifier("Alya told Jasmine that Andrew could pay with cash..") [{'end': 2, 'entity': 'I-PER', 'index': 1, 'score': 0.9997861, 'start': 0, 'word': '▁Al'}, {'end': 4, 'entity': 'I-PER', 'index': 2, 'score': 0.9998591, 'start': 2, 'word': 'ya'}, {'end': 16, 'entity': 'I-PER', 'index': 4, 'score': 0.99995816, 'start': 10, 'word': '▁Jasmin'}, {'end': 17, 'entity': 'I-PER', 'index': 5, 'score': 0.9999584, 'start': 16, 'word': 'e'}, {'end': 29, 'entity': 'I-PER', 'index': 7, 'score': 0.99998057, 'start': 23, 'word': '▁Andrew'}]
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
See the following resources for training data and training procedure details:
See the associated paper for evaluation details.
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019) .
See the associated paper for further details.
BibTeX:
@article{conneau2019unsupervised, title={Unsupervised Cross-lingual Representation Learning at Scale}, author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin}, journal={arXiv preprint arXiv:1911.02116}, year={2019} }
APA:
This model card was written by the team at Hugging Face.
Use the code below to get started with the model. You can use this model directly within a pipeline for NER.
Click to expand>>> from transformers import AutoTokenizer, AutoModelForTokenClassification >>> from transformers import pipeline >>> tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large-finetuned-conll03-english") >>> model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english") >>> classifier = pipeline("ner", model=model, tokenizer=tokenizer) >>> classifier("Hello I'm Omar and I live in Zürich.") [{'end': 14, 'entity': 'I-PER', 'index': 5, 'score': 0.9999175, 'start': 10, 'word': '▁Omar'}, {'end': 35, 'entity': 'I-LOC', 'index': 10, 'score': 0.9999906, 'start': 29, 'word': '▁Zürich'}]