数据集:
embedding-data/PAQ_pairs
Pairs questions and answers obtained from Wikipedia.
Disclaimer: The team releasing PAQ QA pairs did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.
Each example in the dataset contains pairs of sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value". The first sentence is a question and the second an answer; thus, both sentences would be similar.
{"set": [sentence_1, sentence_2]} {"set": [sentence_1, sentence_2]} ... {"set": [sentence_1, sentence_2]}
This dataset is useful for training Sentence Transformers models. Refer to the following post on how to train models using similar pairs of sentences.
Install the ? Datasets library with pip install datasets and load the dataset from the Hub with:
from datasets import load_dataset dataset = load_dataset("embedding-data/PAQ_pairs")
The dataset is loaded as a DatasetDict and has the format:
DatasetDict({ train: Dataset({ features: ['set'], num_rows: 64371441 }) })
Review an example i with:
dataset["train"][i]["set"]
The PAQ QA-pairs and metadata is licensed under CC-BY-SA . Other data is licensed according to the accompanying license files.
@article{lewis2021paq, title={PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them}, author={Patrick Lewis and Yuxiang Wu and Linqing Liu and Pasquale Minervini and Heinrich Küttler and Aleksandra Piktus and Pontus Stenetorp and Sebastian Riedel}, year={2021}, eprint={2102.07033}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Thanks to @patrick-s-h-lewis for adding this dataset.