数据集:

dalle-mini/YFCC100M_OpenAI_subset

预印本库:

arxiv:1503.01817
中文

YFCC100M subset from OpenAI

Subset of YFCC100M used by OpenAI for CLIP , filtered to contain only the images that we could retrieve.

Split train validation
Number of samples 14,808,859 16,374
Size 1.9 TB 2.1 GB

Features:

  • from the original dataset: title , description , photoid , uid , unickname , datetaken , dateuploaded , capturedevice , usertags , machinetags , longitude , latitude , accuracy , pageurl , downloadurl , licensename , licenseurl , serverid , farmid , secret , secretoriginal , ext , marker , key
  • img : image content, can be loaded with PIL.Image.open(io.BytesIO(item['img']))
  • title_clean and description_clean : derived from title and description using clean_text function detailed below
def clean_text(text):
    # decode url
    text = urllib.parse.unquote_plus(text)
    # remove html tags
    text = re.sub('<[^<]+?>', '', text)
    # remove multiple spaces + "\r" + "\n" + "\t"
    text = " ".join(text.split())
    return text