模型:
megagonlabs/transformers-ud-japanese-electra-base-ginza
This is an ELECTRA model pretrained on approximately 200M Japanese sentences extracted from the mC4 and finetuned by spaCy v3 on UD_Japanese_BCCWJ r2.8 .
The base pretrain model is megagonlabs/transformers-ud-japanese-electra-base-discrimininator , which requires SudachiTra for tokenization.
The entire spaCy v3 model is distributed as a python package named ja_ginza_electra from PyPI along with GiNZA v5 which provides some custom pipeline components to recognize the Japanese bunsetu-phrase structures. Try running it as follows:
$ pip install ginza ja-ginza-electra $ ginza
The models are distributed under the terms of the MIT License .
This model is permitted to be published under the MIT License under a joint research agreement between NINJAL (National Institute for Japanese Language and Linguistics) and Megagon Labs Tokyo .
Contains information from mC4 which is made available under the ODC Attribution License .
@article{2019t5, author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu}, title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer}, journal = {arXiv e-prints}, year = {2019}, archivePrefix = {arXiv}, eprint = {1910.10683}, }
Asahara, M., Kanayama, H., Tanaka, T., Miyao, Y., Uematsu, S., Mori, S., Matsumoto, Y., Omura, M., & Murawaki, Y. (2018). Universal Dependencies Version 2 for Japanese. In LREC-2018.