数据集:
dalle-mini/YFCC100M_OpenAI_subset
预印本库:
arxiv:1503.01817Subset of YFCC100M used by OpenAI for CLIP , filtered to contain only the images that we could retrieve.
Split | train | validation |
---|---|---|
Number of samples | 14,808,859 | 16,374 |
Size | 1.9 TB | 2.1 GB |
Features:
def clean_text(text): # decode url text = urllib.parse.unquote_plus(text) # remove html tags text = re.sub('<[^<]+?>', '', text) # remove multiple spaces + "\r" + "\n" + "\t" text = " ".join(text.split()) return text