dbmdz/bert-base-turkish-cased (BERT Checkpoint)
MLSUM is the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish . Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.
Set | Metric | Value |
---|---|---|
Test | Rouge2 - mid -precision | 32.41 |
Test | Rouge2 - mid - recall | 28.65 |
Test | Rouge2 - mid - fmeasure | 29.48 |
import torch from transformers import BertTokenizerFast, EncoderDecoderModel device = 'cuda' if torch.cuda.is_available() else 'cpu' ckpt = 'mrm8488/bert2bert_shared-turkish-summarization' tokenizer = BertTokenizerFast.from_pretrained(ckpt) model = EncoderDecoderModel.from_pretrained(ckpt).to(device) def generate_summary(text): inputs = tokenizer([text], padding="max_length", truncation=True, max_length=512, return_tensors="pt") input_ids = inputs.input_ids.to(device) attention_mask = inputs.attention_mask.to(device) output = model.generate(input_ids, attention_mask=attention_mask) return tokenizer.decode(output[0], skip_special_tokens=True) text = "Your text here..." generate_summary(text)
Created by Manuel Romero/@mrm8488 with the support of Narrativa
Made with ♥ in Spain