数据集:
embedding-data/sentence-compression
Dataset with pairs of equivalent sentences. The dataset is provided "AS IS" without any warranty, express or implied. Google disclaims all liability for any damages, direct or indirect, resulting from using the dataset.
Disclaimer: The team releasing sentence-compression did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.
Each example in the dataset contains pairs of equivalent sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value".
{"set": [sentence_1, sentence_2]} {"set": [sentence_1, sentence_2]} ... {"set": [sentence_1, sentence_2]}
This dataset is useful for training Sentence Transformers models. Refer to the following post on how to train models using similar pairs of sentences.
Install the ? Datasets library with pip install datasets and load the dataset from the Hub with:
from datasets import load_dataset dataset = load_dataset("embedding-data/sentence-compression")
The dataset is loaded as a DatasetDict and has the format:
DatasetDict({ train: Dataset({ features: ['set'], num_rows: 180000 }) })
Review an example i with:
dataset["train"][i]["set"]