数据集:
SLPL/syntran-fa
Syntactic Transformed Version of Farsi QA datasets to make fluent responses from questions and short answers. You can use this dataset by the code below:
import datasets data = datasets.load_dataset('SLPL/syntran-fa', split="train")
Generating fluent responses has always been challenging for the question-answering task, especially in low-resource languages like Farsi. In recent years there were some efforts for enhancing the size of datasets in Farsi. Syntran-fa is a question-answering dataset that accumulates the former Farsi QA dataset's short answers and proposes a complete fluent answer for each pair of (question, short_answer).
This dataset contains nearly 50,000 indices of questions and answers. The dataset that has been used as our sources are in Source Data section .
The main idea for this dataset comes from Fluent Response Generation for Conversational Question Answering where they used a "parser + syntactic rules" module to make different fluent answers from a pair of question and a short answer using a parser and some syntactic rules. In this project, we used stanza as our parser to parse the question and generate a response according to it using the short (sentences without verbs - up to ~4 words) answers. One can continue this project by generating different permutations of the sentence's parts (and thus providing more than one sentence for an answer) or training a seq2seq model which does what we do with our rule-based system (by defining a new text-to-text task).
This dataset can be used for the question-answering task, especially when you are going to generate fluent responses. You can train a seq2seq model with this dataset to generate fluent responses - as done by Fluent Response Generation for Conversational Question Answering .
Each row of the dataset will look like something like the below:
{ 'id': 0, 'question': 'باشگاه هاکی ساوتهمپتون چه نام دارد؟', 'short_answer': 'باشگاه هاکی ساوتهمپتون', 'fluent_answer': 'باشگاه هاکی ساوتهمپتون باشگاه هاکی ساوتهمپتون نام دارد.', 'bert_loss': 1.110097069682014 }
Note: the dataset is sorted increasingly by the bert_loss , so first sentences are more likely to be fluent.
Currently, the dataset just provided the train split. There would be a test split soon.
The source datasets that we used are as follows:
Initial Data Collection and NormalizationWe extract all short answer (sentences without verbs - up to ~4 words) entries of all open source QA datasets in Farsi and used some rules featuring the question parse tree to make long (fluent) answers.
[More Information Needed]
Who are the annotators?[More Information Needed]
The dataset is completely a subset of open source known datasets so all information in it is already there on the internet as a open-source dataset. By the way, we do not take responsibility for any of that.
The dataset is gathered together completely in the Asr Gooyesh Pardaz company's summer internship under the supervision of Soroush Gooran, Prof. Hossein Sameti, and the mentorship of Sadra Sabouri. This project was Farhan Farsi's first internship project.
MIT
[More Information Needed]
Thanks to @farhaaaaa and @sadrasabouri for adding this dataset.