数据集:

wiki_dpr

中文

Dataset Card for "wiki_dpr"

Dataset Summary

This is the wikipedia split used to evaluate the Dense Passage Retrieval (DPR) model. It contains 21M passages from wikipedia along with their DPR embeddings. The wikipedia articles were split into multiple, disjoint text blocks of 100 words as passages.

The wikipedia dump is the one from Dec. 20, 2018.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure

Data Instances

Each instance contains a paragraph of at most 100 words, as well as the title of the wikipedia page it comes from, and the DPR embedding (a 768-d vector).

psgs_w100.multiset.compressed
  • Size of downloaded dataset files: 70.97 GB
  • Size of the generated dataset: 78.42 GB
  • Total amount of disk used: 152.26 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{'id': '1',
 'text': 'Aaron Aaron ( or ; "Ahärôn") is a prophet, high priest, and the brother of Moses in the Abrahamic religions. Knowledge of Aaron, along with his brother Moses, comes exclusively from religious texts, such as the Bible and Quran. The Hebrew Bible relates that, unlike Moses, who grew up in the Egyptian royal court, Aaron and his elder sister Miriam remained with their kinsmen in the eastern border-land of Egypt (Goshen). When Moses first confronted the Egyptian king about the Israelites, Aaron served as his brother\'s spokesman ("prophet") to the Pharaoh. Part of the Law (Torah) that Moses received from'],
 'title': 'Aaron',
 'embeddings': [-0.07233893871307373,
   0.48035329580307007,
   0.18650995194911957,
   -0.5287084579467773,
   -0.37329429388046265,
   0.37622880935668945,
   0.25524479150772095,
   ...
   -0.336689829826355,
   0.6313082575798035,
   -0.7025573253631592]}
psgs_w100.multiset.exact
  • Size of downloaded dataset files: 70.97 GB
  • Size of the generated dataset: 78.42 GB
  • Total amount of disk used: 187.38 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{'id': '1',
 'text': 'Aaron Aaron ( or ; "Ahärôn") is a prophet, high priest, and the brother of Moses in the Abrahamic religions. Knowledge of Aaron, along with his brother Moses, comes exclusively from religious texts, such as the Bible and Quran. The Hebrew Bible relates that, unlike Moses, who grew up in the Egyptian royal court, Aaron and his elder sister Miriam remained with their kinsmen in the eastern border-land of Egypt (Goshen). When Moses first confronted the Egyptian king about the Israelites, Aaron served as his brother\'s spokesman ("prophet") to the Pharaoh. Part of the Law (Torah) that Moses received from'],
 'title': 'Aaron',
 'embeddings': [-0.07233893871307373,
   0.48035329580307007,
   0.18650995194911957,
   -0.5287084579467773,
   -0.37329429388046265,
   0.37622880935668945,
   0.25524479150772095,
   ...
   -0.336689829826355,
   0.6313082575798035,
   -0.7025573253631592]}
psgs_w100.multiset.no_index
  • Size of downloaded dataset files: 70.97 GB
  • Size of the generated dataset: 78.42 GB
  • Total amount of disk used: 149.38 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{'id': '1',
 'text': 'Aaron Aaron ( or ; "Ahärôn") is a prophet, high priest, and the brother of Moses in the Abrahamic religions. Knowledge of Aaron, along with his brother Moses, comes exclusively from religious texts, such as the Bible and Quran. The Hebrew Bible relates that, unlike Moses, who grew up in the Egyptian royal court, Aaron and his elder sister Miriam remained with their kinsmen in the eastern border-land of Egypt (Goshen). When Moses first confronted the Egyptian king about the Israelites, Aaron served as his brother\'s spokesman ("prophet") to the Pharaoh. Part of the Law (Torah) that Moses received from'],
 'title': 'Aaron',
 'embeddings': [-0.07233893871307373,
   0.48035329580307007,
   0.18650995194911957,
   -0.5287084579467773,
   -0.37329429388046265,
   0.37622880935668945,
   0.25524479150772095,
   ...
   -0.336689829826355,
   0.6313082575798035,
   -0.7025573253631592]}
psgs_w100.nq.compressed
  • Size of downloaded dataset files: 70.97 GB
  • Size of the generated dataset: 78.42 GB
  • Total amount of disk used: 152.26 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{'id': '1',
 'text': 'Aaron Aaron ( or ; "Ahärôn") is a prophet, high priest, and the brother of Moses in the Abrahamic religions. Knowledge of Aaron, along with his brother Moses, comes exclusively from religious texts, such as the Bible and Quran. The Hebrew Bible relates that, unlike Moses, who grew up in the Egyptian royal court, Aaron and his elder sister Miriam remained with their kinsmen in the eastern border-land of Egypt (Goshen). When Moses first confronted the Egyptian king about the Israelites, Aaron served as his brother\'s spokesman ("prophet") to the Pharaoh. Part of the Law (Torah) that Moses received from'],
 'title': 'Aaron',
 'embeddings': [0.013342111371457577,
   0.582173764705658,
   -0.31309744715690613,
   -0.6991612911224365,
   -0.5583199858665466,
   0.5187504887580872,
   0.7152731418609619,
   ...
   -0.5385938286781311,
   0.8093984127044678,
   -0.4741983711719513]}
psgs_w100.nq.exact
  • Size of downloaded dataset files: 70.97 GB
  • Size of the generated dataset: 78.42 GB
  • Total amount of disk used: 187.38 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{'id': '1',
 'text': 'Aaron Aaron ( or ; "Ahärôn") is a prophet, high priest, and the brother of Moses in the Abrahamic religions. Knowledge of Aaron, along with his brother Moses, comes exclusively from religious texts, such as the Bible and Quran. The Hebrew Bible relates that, unlike Moses, who grew up in the Egyptian royal court, Aaron and his elder sister Miriam remained with their kinsmen in the eastern border-land of Egypt (Goshen). When Moses first confronted the Egyptian king about the Israelites, Aaron served as his brother\'s spokesman ("prophet") to the Pharaoh. Part of the Law (Torah) that Moses received from'],
 'title': 'Aaron',
 'embeddings': [0.013342111371457577,
   0.582173764705658,
   -0.31309744715690613,
   -0.6991612911224365,
   -0.5583199858665466,
   0.5187504887580872,
   0.7152731418609619,
   ...
   -0.5385938286781311,
   0.8093984127044678,
   -0.4741983711719513]}

Data Fields

The data fields are the same among all splits.

psgs_w100.multiset.compressed
  • id : a string feature.
  • text : a string feature.
  • title : a string feature.
  • embeddings : a list of float32 features.
psgs_w100.multiset.exact
  • id : a string feature.
  • text : a string feature.
  • title : a string feature.
  • embeddings : a list of float32 features.
psgs_w100.multiset.no_index
  • id : a string feature.
  • text : a string feature.
  • title : a string feature.
  • embeddings : a list of float32 features.
psgs_w100.nq.compressed
  • id : a string feature.
  • text : a string feature.
  • title : a string feature.
  • embeddings : a list of float32 features.
psgs_w100.nq.exact
  • id : a string feature.
  • text : a string feature.
  • title : a string feature.
  • embeddings : a list of float32 features.

Data Splits

name train
psgs_w100.multiset.compressed 21015300
psgs_w100.multiset.exact 21015300
psgs_w100.multiset.no_index 21015300
psgs_w100.nq.compressed 21015300
psgs_w100.nq.exact 21015300

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Social Impact of Dataset

More Information Needed

Discussion of Biases

More Information Needed

Other Known Limitations

More Information Needed

Additional Information

Dataset Curators

More Information Needed

Licensing Information

More Information Needed

Citation Information

@misc{karpukhin2020dense,
    title={Dense Passage Retrieval for Open-Domain Question Answering},
    author={Vladimir Karpukhin and Barlas Oğuz and Sewon Min and Patrick Lewis and Ledell Wu and Sergey Edunov and Danqi Chen and Wen-tau Yih},
    year={2020},
    eprint={2004.04906},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Contributions

Thanks to @thomwolf , @lewtun , @lhoestq for adding this dataset.