This is a version of the ScandiBERT model trained without any Faroese data and a different subword tokenizer.
The model was trained on the data shown in the table below. Batch size was 8.8k, the model was trained for 72 epochs on 24 V100 cards for about 2 weeks.
Language | Data | Size |
---|---|---|
Icelandic | See IceBERT paper | 16 GB |
Danish | Danish Gigaword Corpus (incl Twitter) | 4,7 GB |
Norwegian | NCC corpus | 42 GB |
Swedish | Swedish Gigaword Corpus | 3,4 GB |
If you find this model useful, please cite
@inproceedings{snaebjarnarson-etal-2023-transfer, title = "{T}ransfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese", author = "Snæbjarnarson, Vésteinn and Simonsen, Annika and Glavaš, Goran and Vulić, Ivan", booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)", month = "may 22--24", year = "2023", address = "Tórshavn, Faroe Islands", publisher = {Link{\"o}ping University Electronic Press, Sweden}, }