数据集:
dalle-mini/YFCC100M_OpenAI_subset
预印本库:
arxiv:1503.01817Subset of YFCC100M used by OpenAI for CLIP , filtered to contain only the images that we could retrieve.
| Split | train | validation |
|---|---|---|
| Number of samples | 14,808,859 | 16,374 |
| Size | 1.9 TB | 2.1 GB |
Features:
def clean_text(text):
# decode url
text = urllib.parse.unquote_plus(text)
# remove html tags
text = re.sub('<[^<]+?>', '', text)
# remove multiple spaces + "\r" + "\n" + "\t"
text = " ".join(text.split())
return text