模型:

distilbert-base-uncased

任务:

填充掩码

类库:

PyTorch TensorFlow JAX Rust Safetensors Transformers

数据集:

bookcorpus wikipedia 3Awikipedia 3Abookcorpus

语言:

其他:

distilbert exbert AutoTrain Compatible

预印本库:

arxiv:1910.01108

许可:

apache-2.0

模型介绍文件清单

英文

DistilBERT基础模型（不区分大小写）

该模型是BERT基础模型的精简版本。它在这篇论文中进行了介绍。蒸馏过程的代码可以在这里找到。该模型不区分大小写，即不区分英语和English。

模型描述

DistilBERT是一个比BERT更小更快的transformers模型，使用与BERT基础模型相同的语料库进行自监督预训练。这意味着它仅使用原始文本进行预训练，没有以任何方式对其进行人工标注（这就是它可以使用大量公开可用数据的原因），并使用BERT基础模型从这些文本中生成输入和标签的自动化过程。它具体来说以以下三个目标进行预训练：

蒸馏损失：该模型经过训练以返回与BERT基础模型相同的概率。
遮蔽语言建模（MLM）：这是BERT基础模型原始训练损失的一部分。当给定一个句子时，模型会随机遮蔽15%的词语，然后将整个遮蔽后的句子输入模型并预测被遮蔽的词语。这与传统的递归神经网络（RNN）通常一个接着一个地看到词语或GPT这样的自回归模型内部遮蔽后续标记的方式不同。这使得模型能够学习到句子的双向表示。
余弦嵌入损失：该模型还经过训练以生成尽可能接近BERT基础模型的隐藏状态。

通过这种方式，该模型学习到与其教师模型相同的英语语言内在表示，同时在推断或下游任务中具有更快的速度。

拟合使用和限制

您可以使用原始模型进行遮蔽语言建模或下一个句子预测，但它主要用于在下游任务中进行微调。您可以查看模型资源库，寻找您感兴趣的任务的微调版本。

请注意，该模型主要用于对使用整个句子（可能遮蔽）进行决策的任务的微调，例如序列分类、标记分类或问答。对于文本生成等任务，您应该看一看像GPT2这样的模型。

使用方法

您可以使用此模型运行一个遮蔽语言建模的pipeline：

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='distilbert-base-uncased')
>>> unmasker("Hello I'm a [MASK] model.")

[{'sequence': "[CLS] hello i'm a role model. [SEP]",
  'score': 0.05292855575680733,
  'token': 2535,
  'token_str': 'role'},
 {'sequence': "[CLS] hello i'm a fashion model. [SEP]",
  'score': 0.03968575969338417,
  'token': 4827,
  'token_str': 'fashion'},
 {'sequence': "[CLS] hello i'm a business model. [SEP]",
  'score': 0.034743521362543106,
  'token': 2449,
  'token_str': 'business'},
 {'sequence': "[CLS] hello i'm a model model. [SEP]",
  'score': 0.03462274372577667,
  'token': 2944,
  'token_str': 'model'},
 {'sequence': "[CLS] hello i'm a modeling model. [SEP]",
  'score': 0.018145186826586723,
  'token': 11643,
  'token_str': 'modeling'}]

以下是如何使用此模型在PyTorch中获取给定文本的特征：

from transformers import DistilBertTokenizer, DistilBertModel
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained("distilbert-base-uncased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

以及在TensorFlow中：

from transformers import DistilBertTokenizer, TFDistilBertModel
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = TFDistilBertModel.from_pretrained("distilbert-base-uncased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

限制和偏见

即使用于此模型的训练数据可以被认为是相当中立的，该模型可能会有偏见的预测结果。它还继承了其教师模型的偏见。

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='distilbert-base-uncased')
>>> unmasker("The White man worked as a [MASK].")

[{'sequence': '[CLS] the white man worked as a blacksmith. [SEP]',
  'score': 0.1235365942120552,
  'token': 20987,
  'token_str': 'blacksmith'},
 {'sequence': '[CLS] the white man worked as a carpenter. [SEP]',
  'score': 0.10142576694488525,
  'token': 10533,
  'token_str': 'carpenter'},
 {'sequence': '[CLS] the white man worked as a farmer. [SEP]',
  'score': 0.04985016956925392,
  'token': 7500,
  'token_str': 'farmer'},
 {'sequence': '[CLS] the white man worked as a miner. [SEP]',
  'score': 0.03932540491223335,
  'token': 18594,
  'token_str': 'miner'},
 {'sequence': '[CLS] the white man worked as a butcher. [SEP]',
  'score': 0.03351764753460884,
  'token': 14998,
  'token_str': 'butcher'}]

>>> unmasker("The Black woman worked as a [MASK].")

[{'sequence': '[CLS] the black woman worked as a waitress. [SEP]',
  'score': 0.13283951580524445,
  'token': 13877,
  'token_str': 'waitress'},
 {'sequence': '[CLS] the black woman worked as a nurse. [SEP]',
  'score': 0.12586183845996857,
  'token': 6821,
  'token_str': 'nurse'},
 {'sequence': '[CLS] the black woman worked as a maid. [SEP]',
  'score': 0.11708822101354599,
  'token': 10850,
  'token_str': 'maid'},
 {'sequence': '[CLS] the black woman worked as a prostitute. [SEP]',
  'score': 0.11499975621700287,
  'token': 19215,
  'token_str': 'prostitute'},
 {'sequence': '[CLS] the black woman worked as a housekeeper. [SEP]',
  'score': 0.04722772538661957,
  'token': 22583,
  'token_str': 'housekeeper'}]

这个偏见也会影响该模型的所有微调版本。

训练数据

DistilBERT在与BERT相同的数据上进行预训练，这些数据包括11038本未发表的书籍和英语维基百科（不包括列表、表格和标题）。

训练过程

预处理

文本经过小写处理和使用WordPiece进行标记化，使用词汇表大小为30000。模型的输入如下所示：

[CLS] Sentence A [SEP] Sentence B [SEP]

以0.5的概率，句子A和句子B对应于原始语料库中的两个连续句子，其他情况下，是语料库中的另一个随机句子。注意，这里所说的句子是通常比单个句子长的连续文本片段。唯一的约束是这两个“句子”的组合长度不超过512个标记。

每个句子的掩蔽过程的细节如下：

15%的标记被遮蔽。
80%的情况下，遮蔽的标记被替换为[MASK]。
10%的情况下，遮蔽的标记被替换为与其所代替的标记不同的随机标记。
剩下的10%的情况下，遮蔽的标记保持不变。

预训练

该模型使用8个16GB的V100训练了90小时。有关所有超参数详情，请参阅训练代码。

评估结果

在微调下游任务时，该模型实现了以下结果：

Glue测试结果：

Task	MNLI	QQP	QNLI	SST-2	CoLA	STS-B	MRPC	RTE
82.2	88.5	89.2	91.3	51.3	85.8	87.5	59.9

BibTeX条目和引文信息

@article{Sanh2019DistilBERTAD,
  title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
  author={Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf},
  journal={ArXiv},
  year={2019},
  volume={abs/1910.01108}
}

作者:

None

数据集大小:

1.42 GB