数据集:
embedding-data/coco_captions_quintets
COCO is a large-scale object detection, segmentation, and captioning dataset. This repo contains five captions per image; useful for sentence similarity tasks.
Disclaimer: The team releasing COCO did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.
Each example in the dataset contains quintets of similar sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value":
{"set": [sentence_1, sentence_2, sentence3, sentence4, sentence5]} {"set": [sentence_1, sentence_2, sentence3, sentence4, sentence5]} ... {"set": [sentence_1, sentence_2, sentence3, sentence4, sentence5]}
This dataset is useful for training Sentence Transformers models. Refer to the following post on how to train models using similar pairs of sentences.
Install the ? Datasets library with pip install datasets and load the dataset from the Hub with:
from datasets import load_dataset dataset = load_dataset("embedding-data/coco_captions")
The dataset is loaded as a DatasetDict and has the format:
DatasetDict({ train: Dataset({ features: ['set'], num_rows: 82783 }) })
Review an example i with:
dataset["train"][i]["set"]
The annotations in this dataset along with this website belong to the COCO Consortium and are licensed under a Creative Commons Attribution 4.0 License
Thanks to:
for adding this dataset.