模型:
Addedk/kbbert-distilled-cased
This model is a distilled version of KB-BERT . It was distilled using Swedish data, the 2010-2015 portion of the Swedish Culturomics Gigaword Corpus . The code for the distillation process can be found here . This was done as part of my Master's Thesis: Task-agnostic knowledge distillation of mBERT to Swedish .
This is a 6-layer version of KB-BERT, having been distilled using the LightMBERT distillation method, but without freezing the embedding layer.
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task.
The data used for distillation was the 2010-2015 portion of the Swedish Culturomics Gigaword Corpus . The tokenized data had a file size of approximately 7.4 GB.
When evaluated on the SUCX 3.0 dataset, it achieved an average F1 score of 0.887 which is competitive with the score KB-BERT obtained, 0.894.
Additional results and comparisons are presented in my Master's Thesis