模型:
distilbert-base-uncased
该模型是BERT基础模型的精简版本。它在这篇论文中进行了介绍。蒸馏过程的代码可以在这里找到。该模型不区分大小写,即不区分英语和English。
DistilBERT是一个比BERT更小更快的transformers模型,使用与BERT基础模型相同的语料库进行自监督预训练。这意味着它仅使用原始文本进行预训练,没有以任何方式对其进行人工标注(这就是它可以使用大量公开可用数据的原因),并使用BERT基础模型从这些文本中生成输入和标签的自动化过程。它具体来说以以下三个目标进行预训练:
通过这种方式,该模型学习到与其教师模型相同的英语语言内在表示,同时在推断或下游任务中具有更快的速度。
您可以使用原始模型进行遮蔽语言建模或下一个句子预测,但它主要用于在下游任务中进行微调。您可以查看模型资源库,寻找您感兴趣的任务的微调版本。
请注意,该模型主要用于对使用整个句子(可能遮蔽)进行决策的任务的微调,例如序列分类、标记分类或问答。对于文本生成等任务,您应该看一看像GPT2这样的模型。
您可以使用此模型运行一个遮蔽语言建模的pipeline:
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='distilbert-base-uncased')
>>> unmasker("Hello I'm a [MASK] model.")
[{'sequence': "[CLS] hello i'm a role model. [SEP]",
  'score': 0.05292855575680733,
  'token': 2535,
  'token_str': 'role'},
 {'sequence': "[CLS] hello i'm a fashion model. [SEP]",
  'score': 0.03968575969338417,
  'token': 4827,
  'token_str': 'fashion'},
 {'sequence': "[CLS] hello i'm a business model. [SEP]",
  'score': 0.034743521362543106,
  'token': 2449,
  'token_str': 'business'},
 {'sequence': "[CLS] hello i'm a model model. [SEP]",
  'score': 0.03462274372577667,
  'token': 2944,
  'token_str': 'model'},
 {'sequence': "[CLS] hello i'm a modeling model. [SEP]",
  'score': 0.018145186826586723,
  'token': 11643,
  'token_str': 'modeling'}]
以下是如何使用此模型在PyTorch中获取给定文本的特征:
from transformers import DistilBertTokenizer, DistilBertModel
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained("distilbert-base-uncased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
以及在TensorFlow中:
from transformers import DistilBertTokenizer, TFDistilBertModel
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = TFDistilBertModel.from_pretrained("distilbert-base-uncased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
即使用于此模型的训练数据可以被认为是相当中立的,该模型可能会有偏见的预测结果。它还继承了其教师模型的偏见。
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='distilbert-base-uncased')
>>> unmasker("The White man worked as a [MASK].")
[{'sequence': '[CLS] the white man worked as a blacksmith. [SEP]',
  'score': 0.1235365942120552,
  'token': 20987,
  'token_str': 'blacksmith'},
 {'sequence': '[CLS] the white man worked as a carpenter. [SEP]',
  'score': 0.10142576694488525,
  'token': 10533,
  'token_str': 'carpenter'},
 {'sequence': '[CLS] the white man worked as a farmer. [SEP]',
  'score': 0.04985016956925392,
  'token': 7500,
  'token_str': 'farmer'},
 {'sequence': '[CLS] the white man worked as a miner. [SEP]',
  'score': 0.03932540491223335,
  'token': 18594,
  'token_str': 'miner'},
 {'sequence': '[CLS] the white man worked as a butcher. [SEP]',
  'score': 0.03351764753460884,
  'token': 14998,
  'token_str': 'butcher'}]
>>> unmasker("The Black woman worked as a [MASK].")
[{'sequence': '[CLS] the black woman worked as a waitress. [SEP]',
  'score': 0.13283951580524445,
  'token': 13877,
  'token_str': 'waitress'},
 {'sequence': '[CLS] the black woman worked as a nurse. [SEP]',
  'score': 0.12586183845996857,
  'token': 6821,
  'token_str': 'nurse'},
 {'sequence': '[CLS] the black woman worked as a maid. [SEP]',
  'score': 0.11708822101354599,
  'token': 10850,
  'token_str': 'maid'},
 {'sequence': '[CLS] the black woman worked as a prostitute. [SEP]',
  'score': 0.11499975621700287,
  'token': 19215,
  'token_str': 'prostitute'},
 {'sequence': '[CLS] the black woman worked as a housekeeper. [SEP]',
  'score': 0.04722772538661957,
  'token': 22583,
  'token_str': 'housekeeper'}]
这个偏见也会影响该模型的所有微调版本。
DistilBERT在与BERT相同的数据上进行预训练,这些数据包括11038本未发表的书籍和英语维基百科(不包括列表、表格和标题)。
文本经过小写处理和使用WordPiece进行标记化,使用词汇表大小为30000。模型的输入如下所示:
[CLS] Sentence A [SEP] Sentence B [SEP]
以0.5的概率,句子A和句子B对应于原始语料库中的两个连续句子,其他情况下,是语料库中的另一个随机句子。注意,这里所说的句子是通常比单个句子长的连续文本片段。唯一的约束是这两个“句子”的组合长度不超过512个标记。
每个句子的掩蔽过程的细节如下:
该模型使用8个16GB的V100训练了90小时。有关所有超参数详情,请参阅训练代码。
在微调下游任务时,该模型实现了以下结果:
Glue测试结果:
| Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | 
|---|---|---|---|---|---|---|---|---|
| 82.2 | 88.5 | 89.2 | 91.3 | 51.3 | 85.8 | 87.5 | 59.9 | 
@article{Sanh2019DistilBERTAD,
  title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
  author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf},
  journal={ArXiv},
  year={2019},
  volume={abs/1910.01108}
}
