数据集:
embedding-data/simple-wiki
This dataset contains pairs of equivalent sentences obtained from Wikipedia.
Each example in the dataset contains pairs of equivalent sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value".
{"set": [sentence_1, sentence_2]} {"set": [sentence_1, sentence_2]} ... {"set": [sentence_1, sentence_2]}
This dataset is useful for training Sentence Transformers models. Refer to the following post on how to train models using similar sentences.
Install the ? Datasets library with pip install datasets and load the dataset from the Hub with:
from datasets import load_dataset dataset = load_dataset("embedding-data/simple-wiki")
The dataset is loaded as a DatasetDict and has the format:
DatasetDict({ train: Dataset({ features: ['set'], num_rows: 102225 }) })
Review an example i with:
dataset["train"][i]["set"]