模型:
distilbert-base-uncased
该模型是BERT基础模型的精简版本。它在这篇论文中进行了介绍。蒸馏过程的代码可以在这里找到。该模型不区分大小写,即不区分英语和English。
DistilBERT是一个比BERT更小更快的transformers模型,使用与BERT基础模型相同的语料库进行自监督预训练。这意味着它仅使用原始文本进行预训练,没有以任何方式对其进行人工标注(这就是它可以使用大量公开可用数据的原因),并使用BERT基础模型从这些文本中生成输入和标签的自动化过程。它具体来说以以下三个目标进行预训练:
通过这种方式,该模型学习到与其教师模型相同的英语语言内在表示,同时在推断或下游任务中具有更快的速度。
您可以使用原始模型进行遮蔽语言建模或下一个句子预测,但它主要用于在下游任务中进行微调。您可以查看模型资源库,寻找您感兴趣的任务的微调版本。
请注意,该模型主要用于对使用整个句子(可能遮蔽)进行决策的任务的微调,例如序列分类、标记分类或问答。对于文本生成等任务,您应该看一看像GPT2这样的模型。
您可以使用此模型运行一个遮蔽语言建模的pipeline:
>>> from transformers import pipeline >>> unmasker = pipeline('fill-mask', model='distilbert-base-uncased') >>> unmasker("Hello I'm a [MASK] model.") [{'sequence': "[CLS] hello i'm a role model. [SEP]", 'score': 0.05292855575680733, 'token': 2535, 'token_str': 'role'}, {'sequence': "[CLS] hello i'm a fashion model. [SEP]", 'score': 0.03968575969338417, 'token': 4827, 'token_str': 'fashion'}, {'sequence': "[CLS] hello i'm a business model. [SEP]", 'score': 0.034743521362543106, 'token': 2449, 'token_str': 'business'}, {'sequence': "[CLS] hello i'm a model model. [SEP]", 'score': 0.03462274372577667, 'token': 2944, 'token_str': 'model'}, {'sequence': "[CLS] hello i'm a modeling model. [SEP]", 'score': 0.018145186826586723, 'token': 11643, 'token_str': 'modeling'}]
以下是如何使用此模型在PyTorch中获取给定文本的特征:
from transformers import DistilBertTokenizer, DistilBertModel tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased') model = DistilBertModel.from_pretrained("distilbert-base-uncased") text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input)
以及在TensorFlow中:
from transformers import DistilBertTokenizer, TFDistilBertModel tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased') model = TFDistilBertModel.from_pretrained("distilbert-base-uncased") text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='tf') output = model(encoded_input)
即使用于此模型的训练数据可以被认为是相当中立的,该模型可能会有偏见的预测结果。它还继承了其教师模型的偏见。
>>> from transformers import pipeline >>> unmasker = pipeline('fill-mask', model='distilbert-base-uncased') >>> unmasker("The White man worked as a [MASK].") [{'sequence': '[CLS] the white man worked as a blacksmith. [SEP]', 'score': 0.1235365942120552, 'token': 20987, 'token_str': 'blacksmith'}, {'sequence': '[CLS] the white man worked as a carpenter. [SEP]', 'score': 0.10142576694488525, 'token': 10533, 'token_str': 'carpenter'}, {'sequence': '[CLS] the white man worked as a farmer. [SEP]', 'score': 0.04985016956925392, 'token': 7500, 'token_str': 'farmer'}, {'sequence': '[CLS] the white man worked as a miner. [SEP]', 'score': 0.03932540491223335, 'token': 18594, 'token_str': 'miner'}, {'sequence': '[CLS] the white man worked as a butcher. [SEP]', 'score': 0.03351764753460884, 'token': 14998, 'token_str': 'butcher'}] >>> unmasker("The Black woman worked as a [MASK].") [{'sequence': '[CLS] the black woman worked as a waitress. [SEP]', 'score': 0.13283951580524445, 'token': 13877, 'token_str': 'waitress'}, {'sequence': '[CLS] the black woman worked as a nurse. [SEP]', 'score': 0.12586183845996857, 'token': 6821, 'token_str': 'nurse'}, {'sequence': '[CLS] the black woman worked as a maid. [SEP]', 'score': 0.11708822101354599, 'token': 10850, 'token_str': 'maid'}, {'sequence': '[CLS] the black woman worked as a prostitute. [SEP]', 'score': 0.11499975621700287, 'token': 19215, 'token_str': 'prostitute'}, {'sequence': '[CLS] the black woman worked as a housekeeper. [SEP]', 'score': 0.04722772538661957, 'token': 22583, 'token_str': 'housekeeper'}]
这个偏见也会影响该模型的所有微调版本。
DistilBERT在与BERT相同的数据上进行预训练,这些数据包括11038本未发表的书籍和英语维基百科(不包括列表、表格和标题)。
文本经过小写处理和使用WordPiece进行标记化,使用词汇表大小为30000。模型的输入如下所示:
[CLS] Sentence A [SEP] Sentence B [SEP]
以0.5的概率,句子A和句子B对应于原始语料库中的两个连续句子,其他情况下,是语料库中的另一个随机句子。注意,这里所说的句子是通常比单个句子长的连续文本片段。唯一的约束是这两个“句子”的组合长度不超过512个标记。
每个句子的掩蔽过程的细节如下:
该模型使用8个16GB的V100训练了90小时。有关所有超参数详情,请参阅训练代码。
在微调下游任务时,该模型实现了以下结果:
Glue测试结果:
Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE |
---|---|---|---|---|---|---|---|---|
82.2 | 88.5 | 89.2 | 91.3 | 51.3 | 85.8 | 87.5 | 59.9 |
@article{Sanh2019DistilBERTAD, title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter}, author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf}, journal={ArXiv}, year={2019}, volume={abs/1910.01108} }