数据集:
duorc
语言:
en计算机处理:
monolingual语言创建人:
crowdsourced批注创建人:
crowdsourced源数据集:
original预印本库:
arxiv:1804.07927许可:
mitThe DuoRC dataset is an English language dataset of questions and answers gathered from crowdsourced AMT workers on Wikipedia and IMDb movie plots. The workers were given freedom to pick answer from the plots or synthesize their own answers. It contains two sub-datasets - SelfRC and ParaphraseRC. SelfRC dataset is built on Wikipedia movie plots solely. ParaphraseRC has questions written from Wikipedia movie plots and the answers are given based on corresponding IMDb movie plots.
abstractive-qa : The dataset can be used to train a model for Abstractive Question Answering. An abstractive question answering model is presented with a passage and a question and is expected to generate a multi-word answer. The model performance is measured by exact-match and F1 score, similar to SQuAD V1.1 or SQuAD V2 . A BART-based model with a dense retriever may be used for this task.
extractive-qa : The dataset can be used to train a model for Extractive Question Answering. An extractive question answering model is presented with a passage and a question and is expected to predict the start and end of the answer span in the passage. The model performance is measured by exact-match and F1 score, similar to SQuAD V1.1 or SQuAD V2 . BertForQuestionAnswering or any other similar model may be used for this task.
The text in the dataset is in English, as spoken by Wikipedia writers for movie plots. The associated BCP-47 code is en .
{'answers': ['They arrived by train.'], 'no_answer': False, 'plot': "200 years in the future, Mars has been colonized by a high-tech company.\nMelanie Ballard (Natasha Henstridge) arrives by train to a Mars mining camp which has cut all communication links with the company headquarters. She's not alone, as she is with a group of fellow police officers. They find the mining camp deserted except for a person in the prison, Desolation Williams (Ice Cube), who seems to laugh about them because they are all going to die. They were supposed to take Desolation to headquarters, but decide to explore first to find out what happened.They find a man inside an encapsulated mining car, who tells them not to open it. However, they do and he tries to kill them. One of the cops witnesses strange men with deep scarred and heavily tattooed faces killing the remaining survivors. The cops realise they need to leave the place fast.Desolation explains that the miners opened a kind of Martian construction in the soil which unleashed red dust. Those who breathed that dust became violent psychopaths who started to build weapons and kill the uninfected. They changed genetically, becoming distorted but much stronger.The cops and Desolation leave the prison with difficulty, and devise a plan to kill all the genetically modified ex-miners on the way out. However, the plan goes awry, and only Melanie and Desolation reach headquarters alive. Melanie realises that her bosses won't ever believe her. However, the red dust eventually arrives to headquarters, and Melanie and Desolation need to fight once again.", 'plot_id': '/m/03vyhn', 'question': 'How did the police arrive at the Mars mining camp?', 'question_id': 'b440de7d-9c3f-841c-eaec-a14bdff950d1', 'title': 'Ghosts of Mars'}
The data is split into a training, dev and test set in such a way that the resulting sets contain 70%, 15%, and 15% of the total QA pairs and no QA pairs for any movie seen in train are included in the test set. The final split sizes are as follows:
Name Train Dec Test SelfRC 60721 12961 12599 ParaphraseRC 69524 15591 15857
[Needs More Information]
Wikipedia and IMDb movie plots
Initial Data Collection and Normalization[Needs More Information]
Who are the source language producers?[Needs More Information]
For SelfRC, the annotators were allowed to mark an answer span in the plot or synthesize their own answers after reading Wikipedia movie plots. For ParaphraseRC, questions from the Wikipedia movie plots from SelfRC were used and the annotators were asked to answer based on IMDb movie plots.
Who are the annotators?Amazon Mechanical Turk Workers
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
The dataset was intially created by Amrita Saha, Rahul Aralikatte, Mitesh M. Khapra, and Karthik Sankaranarayanan in a collaborated work between IIT Madras and IBM Research.
@inproceedings{DuoRC, author = { Amrita Saha and Rahul Aralikatte and Mitesh M. Khapra and Karthik Sankaranarayanan}, title = {{DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension}}, booktitle = {Meeting of the Association for Computational Linguistics (ACL)}, year = {2018} }
Thanks to @gchhablani for adding this dataset.