license:unknownSBU Captioned Photo Dataset is a collection of associated captions and images from Flickr.
This dataset doesn't download the images locally by default. Instead, it exposes URLs to the images. To fetch the images, use the following code:
from concurrent.futures import ThreadPoolExecutor from functools import partial import io import urllib import PIL.Image from datasets import load_dataset from datasets.utils.file_utils import get_datasets_user_agent USER_AGENT = get_datasets_user_agent() def fetch_single_image(image_url, timeout=None, retries=0): for _ in range(retries + 1): try: request = urllib.request.Request( image_url, data=None, headers={"user-agent": USER_AGENT}, ) with urllib.request.urlopen(request, timeout=timeout) as req: image = PIL.Image.open(io.BytesIO(req.read())) break except Exception: image = None return image def fetch_images(batch, num_threads, timeout=None, retries=0): fetch_single_image_with_args = partial(fetch_single_image, timeout=timeout, retries=retries) with ThreadPoolExecutor(max_workers=num_threads) as executor: batch["image"] = list(executor.map(fetch_single_image_with_args, batch["image_url"])) return batch num_threads = 20 dset = load_dataset("sbu_captions") dset = dset.map(fetch_images, batched=True, batch_size=100, fn_kwargs={"num_threads": num_threads})
All captions are in English.
Each instance in SBU Captioned Photo Dataset represents a single image with a caption and a user_id:
{ 'img_url': 'http://static.flickr.com/2723/4385058960_b0f291553e.jpg', 'user_id': '47889917@N08', 'caption': 'A wooden chair in the living room' }
All the data is contained in training split. The training set has 1M instances.
From the paper:
One contribution is our technique for the automatic collection of this new dataset – performing a huge number of Flickr queries and then filtering the noisy results down to 1 million images with associated visually relevant captions. Such a collection allows us to approach the extremely challenging problem of description generation using relatively simple non-parametric methods and produces surprisingly effective results.
The source images come from Flickr.
Initial Data Collection and NormalizationOne key contribution of our paper is a novel web-scale database of photographs with associated descriptive text. To enable effective captioning of novel images, this database must be good in two ways: 1) It must be large so that image based matches to a query are reasonably similar, 2) The captions associated with the data base photographs must be visually relevant so that transferring captions between pictures is useful. To achieve the first requirement we query Flickr using a huge number of pairs of query terms (objects, attributes, actions, stuff, and scenes). This produces a very large, but noisy initial set of photographs with associated text.
Who are the source language producers?The Flickr users.
Text descriptions associated with the images are inherited as annotations/captions.
Who are the annotators?The Flickr users.
Vicente Ordonez, Girish Kulkarni and Tamara L. Berg.
Not specified.
@inproceedings{NIPS2011_5dd9db5e, author = {Ordonez, Vicente and Kulkarni, Girish and Berg, Tamara}, booktitle = {Advances in Neural Information Processing Systems}, editor = {J. Shawe-Taylor and R. Zemel and P. Bartlett and F. Pereira and K.Q. Weinberger}, pages = {}, publisher = {Curran Associates, Inc.}, title = {Im2Text: Describing Images Using 1 Million Captioned Photographs}, url = {https://proceedings.neurips.cc/paper/2011/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf}, volume = {24}, year = {2011} }
Thanks to @thomasw21 for adding this dataset