YFCC100M subset from OpenAI

Subset of YFCC100M used by OpenAI for CLIP , filtered to contain only the images that we could retrieve.

Split	train	validation
Number of samples	14,808,859	16,374
Size	1.9 TB	2.1 GB

Features:

from the original dataset: title , description , photoid , uid , unickname , datetaken , dateuploaded , capturedevice , usertags , machinetags , longitude , latitude , accuracy , pageurl , downloadurl , licensename , licenseurl , serverid , farmid , secret , secretoriginal , ext , marker , key
img : image content, can be loaded with PIL.Image.open(io.BytesIO(item['img']))
title_clean and description_clean : derived from title and description using clean_text function detailed below

def clean_text(text):
    # decode url
    text = urllib.parse.unquote_plus(text)
    # remove html tags
    text = re.sub('<[^<]+?>', '', text)
    # remove multiple spaces + "\r" + "\n" + "\t"
    text = " ".join(text.split())
    return text

作者:

dalle-mini

数据集大小:

1.04 GB