数据集:
embedding-data/WikiAnswers
The WikiAnswers corpus contains clusters of questions tagged by WikiAnswers users as paraphrases. Each cluster optionally contains an answer provided by WikiAnswers users. There are 30,370,994 clusters containing an average of 25 questions per cluster. 3,386,256 (11%) of the clusters have an answer.
Each example in the dataset contains 25 equivalent sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value".
{"set": [sentence_1, sentence_2, ..., sentence_25]} {"set": [sentence_1, sentence_2, ..., sentence_25]} ... {"set": [sentence_1, sentence_2, ..., sentence_25]}
This dataset is useful for training Sentence Transformers models. Refer to the following post on how to train models using similar sentences.
Install the ? Datasets library with pip install datasets and load the dataset from the Hub with:
from datasets import load_dataset dataset = load_dataset("embedding-data/WikiAnswers")
The dataset is loaded as a DatasetDict and has the format for N examples:
DatasetDict({ train: Dataset({ features: ['set'], num_rows: N }) })
Review an example i with:
dataset["train"][i]["set"]
@inproceedings{Fader14, author = {Anthony Fader and Luke Zettlemoyer and Oren Etzioni}, title = {{Open Question Answering Over Curated and Extracted Knowledge Bases}}, booktitle = {KDD}, year = {2014} }