Chinese-CLIP-ViT-Base-Patch16

介绍

这是中国CLIP的基础版，使用ViT-B/16作为图像编码器和RoBERTa-wwm-base作为文本编码器。中国CLIP是在大约2亿个中文图像-文本对上的大规模数据集上简单实现的CLIP。有关更多详细信息，请参阅我们的技术报告 https://arxiv.org/abs/2211.01335 和我们的官方GitHub仓库 https://github.com/OFA-Sys/Chinese-CLIP (欢迎点赞！🔥🔥)

使用官方API

我们提供了一个简单的代码片段，展示如何使用中国CLIP的API计算图像和文本的嵌入向量和相似度。

from PIL import Image
import requests
from transformers import ChineseCLIPProcessor, ChineseCLIPModel

model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")

url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
# Squirtle, Bulbasaur, Charmander, Pikachu in English
texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]

# compute image feature
inputs = processor(images=image, return_tensors="pt")
image_features = model.get_image_features(**inputs)
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute text features
inputs = processor(text=texts, padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute image-text similarity scores
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # probs: [[1.2686e-03, 5.4499e-02, 6.7968e-04, 9.4355e-01]]

但是，如果您不满意只使用API，欢迎查看我们的GitHub仓库 https://github.com/OFA-Sys/Chinese-CLIP ，了解有关训练和推理的更多详细信息。

结果

MUGE文本到图像检索 :

Setup	Zero-shot	Finetune
Metric	R@1	R@5	R@10	MR	R@1	R@5	R@10	MR
Wukong	42.7	69.0	78.0	63.2	52.7	77.9	85.6	72.1
R2D2	49.5	75.7	83.2	69.5	60.1	82.9	89.4	77.5
CN-CLIP	63.0	84.1	89.2	78.8	68.9	88.7	93.1	83.6

Flickr30K-CN 检索 :

Task	Text-to-Image	Image-to-Text
Setup	Zero-shot	Finetune	Zero-shot	Finetune
Metric	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
Wukong	51.7	78.9	86.3	77.4	94.5	97.0	76.1	94.8	97.5	92.7	99.1	99.6
R2D2	60.9	86.8	92.7	84.4	96.7	98.4	77.6	96.7	98.9	95.6	99.8	100.0
CN-CLIP	71.2	91.4	95.5	83.8	96.9	98.6	81.6	97.5	98.8	95.3	99.7	100.0

COCO-CN 检索 :

Task	Text-to-Image	Image-to-Text
Setup	Zero-shot	Finetune	Zero-shot	Finetune
Metric	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
Wukong	53.4	80.2	90.1	74.0	94.4	98.1	55.2	81.0	90.6	73.3	94.0	98.0
R2D2	56.4	85.0	93.1	79.1	96.5	98.9	63.3	89.3	95.7	79.3	97.1	98.7
CN-CLIP	69.2	89.9	96.1	81.5	96.9	99.1	63.0	86.6	92.9	83.5	97.3	99.2

零样本图像分类 :

Task	CIFAR10	CIFAR100	DTD	EuroSAT	FER	FGVC	KITTI	MNIST	PC	VOC
GIT	88.5	61.1	42.9	43.4	41.4	6.7	22.1	68.9	50.0	80.2
ALIGN	94.9	76.8	66.1	52.1	50.8	25.0	41.2	74.0	55.2	83.0
CLIP	94.9	77.0	56.0	63.0	48.3	33.3	11.5	79.0	62.3	84.0
Wukong	95.4	77.1	40.9	50.3	-	-	-	-	-	-
CN-CLIP	96.0	79.7	51.2	52.0	55.1	26.2	49.9	79.4	63.5	84.9

引用

如果您觉得中国CLIP有帮助，请随意引用我们的论文。谢谢您的支持！

@article{chinese-clip,
  title={Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese},
  author={Yang, An and Pan, Junshu and Lin, Junyang and Men, Rui and Zhang, Yichang and Zhou, Jingren and Zhou, Chang},
  journal={arXiv preprint arXiv:2211.01335},
  year={2022}
}

作者:

OFA-Sys

数据集大小:

1.4 GB