This repository provides a Japanese BART model. The model was trained by Stockmark Inc.
An introductory article on the model can be found at the following URL.
https://tech.stockmark.co.jp/blog/bart-japanese-base-news/
BART is a transformer encoder-decoder (seq2seq) model with a bidirectional (BERT-like) encoder and an autoregressive (GPT-like) decoder. BART is pre-trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text.
BART is particularly effective when fine-tuned for text generation (e.g. summarization, translation) but also works well for comprehension tasks (e.g. text classification, question answering).
You can use the raw model for text infilling. However, the model is mostly meant to be fine-tuned on a supervised dataset.
NOTE: Since we are using a custom tokenizer, please use trust_remote_code=True to initialize the tokenizer.
from transformers import AutoTokenizer, BartModel model_name = "stockmark/bart-base-japanese-news" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = BartModel.from_pretrained(model_name) inputs = tokenizer("今日は良い天気です。", return_tensors="pt") outputs = model(**inputs) last_hidden_states = outputs.last_hidden_state
import torch from transformers import AutoTokenizer, BartForConditionalGeneration model_name = "stockmark/bart-base-japanese-news" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = BartForConditionalGeneration.from_pretrained(model_name) if torch.cuda.is_available(): model = model.to("cuda") # correct order text is "明日は大雨です。電車は止まる可能性があります。ですから、自宅から働きます。" text = "電車は止まる可能性があります。ですから、自宅から働きます。明日は大雨です。" inputs = tokenizer([text], max_length=128, return_tensors="pt", truncation=True) text_ids = model.generate(inputs["input_ids"].to(model.device), num_beams=3, max_length=128) output = tokenizer.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] print(output) # sample output: 明日は大雨です。電車は止まる可能性があります。ですから、自宅から働きます。
import torch from transformers import AutoTokenizer, BartForConditionalGeneration model_name = "stockmark/bart-base-japanese-news" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = BartForConditionalGeneration.from_pretrained(model_name) if torch.cuda.is_available(): model = model.to("cuda") text = "今日の天気は<mask>のため、傘が必要でしょう。" inputs = tokenizer([text], max_length=128, return_tensors="pt", truncation=True) text_ids = model.generate(inputs["input_ids"].to(model.device), num_beams=3, max_length=128) output = tokenizer.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] print(output) # sample output: 今日の天気は、雨のため、傘が必要でしょう。
NOTE: You can use the raw model for text generation. However, the model is mostly meant to be fine-tuned on a supervised dataset.
import torch from transformers import AutoTokenizer, BartForConditionalGeneration model_name = "stockmark/bart-base-japanese-news" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = BartForConditionalGeneration.from_pretrained(model_name) if torch.cuda.is_available(): model = model.to("cuda") text = "自然言語処理(しぜんげんごしょり、略称:NLP)は、人間が日常的に使っている自然言語をコンピュータに処理させる一連の技術であり、人工知能と言語学の一分野である。「計算言語学」(computational linguistics)との類似もあるが、自然言語処理は工学的な視点からの言語処理をさすのに対して、計算言語学は言語学的視点を重視する手法をさす事が多い。" inputs = tokenizer([text], max_length=512, return_tensors="pt", truncation=True) text_ids = model.generate(inputs["input_ids"].to(model.device), num_beams=3, min_length=0, max_length=40) output = tokenizer.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] print(output) # sample output: 自然言語処理(しぜんげんごしょり、略称:NLP)は、人間が日常的に使っている自然言語をコンピュータに処理させる一連の技術であり、言語学の一分野である。
The model was trained on Japanese News Articles.
The model uses a sentencepiece -based tokenizer. The vocabulary was first trained on a selected subset from the training data using the official sentencepiece training script.
The pretrained models are distributed under the terms of the MIT License .
NOTE: Only tokenization_bart_japanese_news.py is Apache License, Version 2.0 . Please see tokenization_bart_japanese_news.py for license details.
If you have any questions, please contact us using our contact form .
This comparison study supported with Cloud TPUs from Google’s TensorFlow Research Cloud (TFRC).