banglat5_banglaparaphrase

This repository contains the pretrained checkpoint of the model BanglaT5 finetuned on BanglaParaphrase dataset. This is a sequence to sequence transformer model pretrained with the "Span Corruption" objective. Finetuned models using this checkpoint achieve competitive results on the dataset.

For finetuning and inference, refer to the scripts in the official GitHub repository of BanglaNLG .

Note : This model was pretrained using a specific normalization pipeline available here . All finetuning scripts in the official GitHub repository use this normalization by default. If you need to adapt the pretrained model for a different task make sure the text units are normalized using this pipeline before tokenizing to get best results. A basic example is given below:

Using this model in transformers

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from normalizer import normalize # pip install git+https://github.com/csebuetnlp/normalizer

model = AutoModelForSeq2SeqLM.from_pretrained("csebuetnlp/banglat5_banglaparaphrase")
tokenizer = AutoTokenizer.from_pretrained("csebuetnlp/banglat5_banglaparaphrase", use_fast=False)

input_sentence = ""
input_ids = tokenizer(normalize(input_sentence), return_tensors="pt").input_ids
generated_tokens = model.generate(input_ids)
decoded_tokens = tokenizer.batch_decode(generated_tokens)[0]

print(decoded_tokens)

Benchmarks

Supervised fine-tuning

Test Set	Model	sacreBLEU	ROUGE-L	PINC	BERTScore	BERT-iBLEU
BanglaParaphrase	BanglaT5 IndicBART IndicBARTSS	32.8 5.60 4.90	63.58 35.61 33.66	74.40 80.26 82.10	94.80 91.50 91.10	92.18 91.16 90.95
IndicParaphrase	BanglaT5 IndicBART IndicBARTSS	11.0 12.0 10.7	19.99 21.58 20.59	74.50 76.83 77.60	94.80 93.30 93.10	87.738 90.65 90.54

The dataset can be found in the link below:

BanglaParaphrase

Citation

If you use this model, please cite the following paper:

@article{akil2022banglaparaphrase,
  title={BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset},
  author={Akil, Ajwad and Sultana, Najrin and Bhattacharjee, Abhik and Shahriyar, Rifat},
  journal={arXiv preprint arXiv:2210.05109},
  year={2022}
}

作者:

BUET CSE NLP Group

数据集大小:

945.47 MB