数据集:
multi_re_qa
MultiReQA contains the sentence boundary annotation from eight publicly available QA datasets including SearchQA, TriviaQA, HotpotQA, NaturalQuestions, SQuAD, BioASQ, RelationExtraction, and TextbookQA. Five of these datasets, including SearchQA, TriviaQA, HotpotQA, NaturalQuestions, SQuAD, contain both training and test data, and three, in cluding BioASQ, RelationExtraction, TextbookQA, contain only the test data (also includes DuoRC but not specified in the official documentation)
Sentence boundary annotation for SearchQA, TriviaQA, HotpotQA, NaturalQuestions, SQuAD, BioASQ, RelationExtraction, TextbookQA and DuoRC
The general format is: { "candidate_id": <candidate_id>, "response_start": <response_start>, "response_end": <response_end> } ...
An example from SearchQA: {'candidate_id': 'SearchQA_000077f3912049dfb4511db271697bad/_0_1', 'response_end': 306, 'response_start': 243}
{ "candidate_id": <STRING>, "response_start": <INT>, "response_end": <INT> } ...
Train and Dev splits are available only for the following datasets,
Test splits are available only for the following datasets,
The number of candidate sentences for each dataset in the table below.
MultiReQA | ||
---|---|---|
train | test | |
SearchQA | 629,160 | 454,836 |
TriviaQA | 335,659 | 238,339 |
HotpotQA | 104,973 | 52,191 |
SQuAD | 87,133 | 10,642 |
NaturalQuestions | 106,521 | 22,118 |
BioASQ | - | 14,158 |
RelationExtraction | - | 3,301 |
TextbookQA | - | 3,701 |
MultiReQA is a new multi-domain ReQA evaluation suite composed of eight retrieval QA tasks drawn from publicly available QA datasets from the MRQA shared task . The dataset was curated by converting existing QA datasets from MRQA shared task to the format of MultiReQA benchmark.
The Initial data collection was performed by converting existing QA datasets from MRQA shared task to the format of MultiReQA benchmark.
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?The annotators/curators of the dataset are mandyguo-xyguo and mwurts4google , the contributors of the official MultiReQA github repository
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
The annotators/curators of the dataset are mandyguo-xyguo and mwurts4google , the contributors of the official MultiReQA github repository
[More Information Needed]
@misc{m2020multireqa, title={MultiReQA: A Cross-Domain Evaluation for Retrieval Question Answering Models}, author={Mandy Guo and Yinfei Yang and Daniel Cer and Qinlan Shen and Noah Constant}, year={2020}, eprint={2005.02507}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Thanks to @Karthik-Bhaskar for adding this dataset.