数据集:
castorini/odqa-wiki-corpora
The Wikipedia corpus variants provided can serve as knowledge sources for question-answering systems based on a retriever–reader pipeline. These corpus variants and their corresponding experiments are described further in the paper entitled:
Pre-Processing Matters! Improved Wikipedia Corpora for Open-Domain Question Answering.
The dataset consists of passages that have been segmented from Wikipedia articles. For each passage, the following fields are provided
There are 6 corpus variants in total
Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering . Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 6769-6781, 2020.
We start with downloading the full December 20, 2018 Wikipedia XML dump: enwiki-20181220-pages-articles.xml from the Internet Archive: https://archive.org/details/enwiki-20181220 . This is then Pre-processed by WikiExtractor: https://github.com/attardi/wikiextractor (making sure to modify the code to include lists as desired and replacing any tables with the string "TABLETOREPLACE") and DrQA: https://github.com/facebookresearch/DrQA/tree/main/scripts/retriever (again making sure to modify the code to not remove lists as desired).
We then apply the pre-processing script we make available in Pyserini to generate the different corpus variants.