数据集:

Exr0n/wiki-entity-similarity

中文

Wiki Entity Similarity

Usage:

from datasets import load_dataset

corpus = load_dataset('Exr0n/wiki-entity-similarity', '2018thresh20corpus', split='train')
assert corpus[0] == {'article': 'A1000 road', 'link_text': 'A1000', 'is_same': 1}

pairs = load_dataset('Exr0n/wiki-entity-similarity', '2018thresh20pairs', split='train')
assert corpus[0] == {'article': 'Rhinobatos', 'link_text': 'Ehinobatos beurleni', 'is_same': 1}
assert len(corpus) == 4_793_180

Corpus ( name=*corpus )

The corpora in this are generated by aggregating the link text that refers to various articles in context. For instance, if wiki article A refers to article B as C, then C is added to the list of aliases for article B, and the pair (B, C) is included in the dataset.

Following (DPR https://arxiv.org/pdf/2004.04906.pdf ), we use the English Wikipedia dump from Dec. 20, 2018 as the source documents for link collection.

The dataset includes three quality levels, distinguished by the minimum number of inbound links required to include an article in the dataset. This filtering is motivated by the heuristic "better articles have more citations."

Min. Inbound Links Number of Articles Number of Distinct Links
5 1,080,073 5,787,081
10 605,775 4,407,409
20 324,949 3,195,545

Training Pairs ( name=*pairs )

This dataset also includes training pair datasets (with both positive and negative examples) intended for training classifiers. The train/dev/test split is 75/15/10 % of each corpus.

Training Data Generation

The training pairs in this dataset are generated by taking each example from the corpus as a positive example, and creating a new negative example from the article title of the positive example and a random link text from a different article.

The articles featured in each split are disjoint from the other splits, and each split has the same number of positive (semantically the same) and negative (semantically different) examples.

For more details on the dataset motivation, see the paper . If you use this dataset in your work, please cite it using the ArXiv reference.

Generation scripts can be found in the GitHub repo .