This is the longer input version of RoBERTa Japanese model pretrained on approximately 200M Japanese sentences. max_position_embeddings has been increased to 1282 , allowing it to handle much longer inputs than the basic RoBERTa model.
The tokenization model and logic is completely same as nlp-waseda/roberta-base-japanese . The input text should be pretokenized by Juman++ v2.0.0-rc3 and then the SentencePiece tokenization will be applied for the whitespace-separated token sequences. See tokenizer_config.json for details.
Please install Juman++ v2.0.0-rc3 and SentencePiece in advance.
You can load the model and the tokenizer via AutoModel and AutoTokenizer, respectively.
from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained("megagonlabs/roberta-long-japanese") tokenizer = AutoTokenizer.from_pretrained("megagonlabs/roberta-long-japanese") model(**tokenizer("まさに オール マイ ティー な 商品 だ 。", return_tensors="pt")).last_hidden_state tensor([[[ 0.1549, -0.7576, 0.1098, ..., 0.7124, 0.8062, -0.9880], [-0.6586, -0.6138, -0.5253, ..., 0.8853, 0.4822, -0.6463], [-0.4502, -1.4675, -0.4095, ..., 0.9053, -0.2017, -0.7756], ..., [ 0.3505, -1.8235, -0.6019, ..., -0.0906, -0.5479, -0.6899], [ 1.0524, -0.8609, -0.6029, ..., 0.1022, -0.6802, 0.0982], [ 0.6519, -0.2042, -0.6205, ..., -0.0738, -0.0302, -0.1955]]], grad_fn=<NativeLayerNormBackward0>)
The model architecture is almost the same as nlp-waseda/roberta-base-japanese except max_position_embeddings has been increased to 1282 ; 12 layers, 768 dimensions of hidden states, and 12 attention heads.
This model is trained on the Japanese texts extracted from the mC4 Common Crawl's multilingual web crawl corpus. We used the Sudachi to split texts into sentences, and also applied a simple rule-based filter to remove nonlinguistic segments of mC4 multilingual corpus. The extracted texts contains over 600M sentences in total, and we used approximately 200M sentences for pretraining.
We used huggingface/transformers RoBERTa implementation for pretraining. The time required for the pretrainig was about 700 hours using GCP A100 8gpu instance with enabling Automatic Mixed Precision.
The pretrained models are distributed under the terms of the MIT License .
Contains information from mC4 which is made available under the ODC Attribution License .
@article{2019t5, author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu}, title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer}, journal = {arXiv e-prints}, year = {2019}, archivePrefix = {arXiv}, eprint = {1910.10683}, }