模型:
NlpHUST/t5-en-vi-small
任务:
文生文预印本库:
arxiv:1706.05565The IWSLT'15 English-Vietnamese data is used from Stanford NLP group . For all experiments the corpus was split into training, development and test set:
Data set | Sentences | Download |
---|---|---|
Training | 133,317 | via GitHub or located in data/train-en-vi.tgz |
Development | 1,553 | via GitHub or located in data/dev-2012-en-vi.tgz |
Test | 1,268 | via GitHub or located in data/test-2013-en-vi.tgz |
The results on test set.
Model | BLEU (Beam Search) |
---|---|
Luong & Manning (2015) | 23.30 |
Sequence-to-sequence model with attention | 26.10 |
Neural Phrase-based Machine Translation Huang et. al. (2017) | 27.69 |
Neural Phrase-based Machine Translation + LM Huang et. al. (2017) | 28.07 |
t5-en-vi-small (pretraining, without training data) | 28.46 (cased) / 29.23 (uncased) |
t5-en-vi-small (fineturning with training data) | 32.38 (cased) / 33.19 (uncased) |
import torch from transformers import T5ForConditionalGeneration, T5Tokenizer import torch if torch.cuda.is_available(): device = torch.device("cuda") print('There are %d GPU(s) available.' % torch.cuda.device_count()) print('We will use the GPU:', torch.cuda.get_device_name(0)) else: print('No GPU available, using the CPU instead.') device = torch.device("cpu") model = T5ForConditionalGeneration.from_pretrained("NlpHUST/t5-en-vi-small") tokenizer = T5Tokenizer.from_pretrained("NlpHUST/t5-en-vi-small") model.to(device) src = "In school , we spent a lot of time studying the history of Kim Il-Sung , but we never learned much about the outside world , except that America , South Korea , Japan are the enemies ." tokenized_text = tokenizer.encode(src, return_tensors="pt").to(device) model.eval() summary_ids = model.generate( tokenized_text, max_length=128, num_beams=5, repetition_penalty=2.5, length_penalty=1.0, early_stopping=True ) output = tokenizer.decode(summary_ids[0], skip_special_tokens=True) print(output)Output
Ở trường, chúng tôi dành nhiều thời gian để nghiên cứu về lịch sử Kim Il-Sung, nhưng chúng tôi chưa bao giờ học được nhiều về thế giới bên ngoài, ngoại trừ Mỹ, Hàn Quốc, Nhật Bản là kẻ thù.
For personal communication related to this project, please contact Nha Nguyen Van ( nha282@gmail.com ).