模型:
dbmdz/bert-base-german-uncased
In this repository the MDZ Digital Library team (dbmdz) at the Bavarian State Library open sources another German BERT models ?
In addition to the recently released German BERT model by deepset we provide another German-language model.
The source data for the model consists of a recent Wikipedia dump, EU Bookshop corpus, Open Subtitles, CommonCrawl, ParaCrawl and News Crawl. This results in a dataset with a size of 16GB and 2,350,234,427 tokens.
For sentence splitting, we use spacy . Our preprocessing steps (sentence piece model for vocab generation) follow those used for training SciBERT . The model is trained with an initial sequence length of 512 subwords and was performed for 1.5M steps.
This release includes both cased and uncased models.
Currently only PyTorch- Transformers compatible weights are available. If you need access to TensorFlow checkpoints, please raise an issue!
Model | Downloads |
---|---|
bert-base-german-dbmdz-cased | config.json • pytorch_model.bin • vocab.txt |
bert-base-german-dbmdz-uncased | config.json • pytorch_model.bin • vocab.txt |
With Transformers >= 2.3 our German BERT models can be loaded like:
from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-cased") model = AutoModel.from_pretrained("dbmdz/bert-base-german-cased")
For results on downstream tasks like NER or PoS tagging, please refer to this repository .
All models are available on the Huggingface model hub .
For questions about our BERT models just open an issue here ?
Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC). Thanks for providing access to the TFRC ❤️
Thanks to the generous support from the Hugging Face team, it is possible to download both cased and uncased models from their S3 storage ?