模型:
pedramyazdipoor/persian_xlm_roberta_large
任务:
问答XLM-RoBERTA is a multilingual language model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. It was introduced in the paper Unsupervised Cross-lingual Representation Learning at Scale by Conneau et al. .
Multilingual XLM-RoBERTa large for QA on various languages is fine-tuned on various QA datasets but PQuAD, which is the biggest persian QA dataset so far. This second model is our base model to be fine-tuned.
Paper presenting PQuAD dataset: arXiv:2202.06219
This model is fine-tuned on PQuAD Train set and is easily ready to use. Its very long training time encouraged me to publish this model in order to make life easier for those who need.
I set batch size to 4 due to the limitations of GPU memory in Google Colab.
batch_size = 4 n_epochs = 1 base_LM_model = "deepset/xlm-roberta-large-squad2" max_seq_len = 256 learning_rate = 3e-5 evaluation_strategy = "epoch", save_strategy = "epoch", learning_rate = 3e-5, warmup_ratio = 0.1, gradient_accumulation_steps = 8, weight_decay = 0.01,
Evaluated on the PQuAD Persian test set with the official PQuAD link . I trained for more than 1 epoch as well, but I get worse results. Our XLM-Roberta outperforms our ParsBert on PQuAD , but the former is more than 3 times bigger than the latter one; so comparing these two is not fair.
Metric | Our XLM-Roberta Large | Our ParsBert |
---|---|---|
Exact Match | 66.56* | 47.44 |
F1 | 87.31* | 81.96 |
from transformers import AutoTokenizer, AutoModelForQuestionAnswering path = 'pedramyazdipoor/persian_xlm_roberta_large' tokenizer = AutoTokenizer.from_pretrained(path) model = AutoModelForQuestionAnswering.from_pretrained(path)
There are some considerations for inference:
def generate_indexes(start_logits, end_logits, N, min_index): output_start = start_logits output_end = end_logits start_indexes = np.arange(len(start_logits)) start_probs = output_start list_start = dict(zip(start_indexes, start_probs.tolist())) end_indexes = np.arange(len(end_logits)) end_probs = output_end list_end = dict(zip(end_indexes, end_probs.tolist())) sorted_start_list = sorted(list_start.items(), key=lambda x: x[1], reverse=True) #Descending sort by probability sorted_end_list = sorted(list_end.items(), key=lambda x: x[1], reverse=True) final_start_idx, final_end_idx = [[] for l in range(2)] start_idx, end_idx, prob = 0, 0, (start_probs.tolist()[0] + end_probs.tolist()[0]) for a in range(0,N): for b in range(0,N): if (sorted_start_list[a][1] + sorted_end_list[b][1]) > prob : if (sorted_start_list[a][0] <= sorted_end_list[b][0]) and (sorted_start_list[a][0] > min_index) : prob = sorted_start_list[a][1] + sorted_end_list[b][1] start_idx = sorted_start_list[a][0] end_idx = sorted_end_list[b][0] final_start_idx.append(start_idx) final_end_idx.append(end_idx) return final_start_idx[0], final_end_idx[0]
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") model.eval().to(device) text = 'سلام من پدرامم 26 سالمه' question = 'چند سالمه؟' encoding = tokenizer(question,text,add_special_tokens = True, return_token_type_ids = True, return_tensors = 'pt', padding = True, return_offsets_mapping = True, truncation = 'only_first', max_length = 32) out = model(encoding['input_ids'].to(device),encoding['attention_mask'].to(device), encoding['token_type_ids'].to(device)) #we had to change some pieces of code to make it compatible with one answer generation at a time #If you have unanswerable questions, use out['start_logits'][0][0:] and out['end_logits'][0][0:] because <s> (the 1st token) is for this situation and must be compared with other tokens. #you can initialize min_index in generate_indexes() to put force on tokens being chosen to be within the context(startindex must be greater than seperator token). answer_start_index, answer_end_index = generate_indexes(out['start_logits'][0][1:], out['end_logits'][0][1:], 5, 0) print(tokenizer.tokenize(text + question)) print(tokenizer.tokenize(text + question)[answer_start_index : (answer_end_index + 1)]) >>> ['▁سلام', '▁من', '▁پدر', 'ام', 'م', '▁26', '▁سالم', 'ه', 'نام', 'م', '▁چیست', '؟'] >>> ['▁26']
We hereby, express our gratitude to the Newsha Shahbodaghkhan for facilitating dataset gathering.
This is the second version of our Persian XLM-Roberta-Large. There were some problems using the previous version.