模型:

OFA-Sys/chinese-clip-vit-base-patch16

英文

Chinese-CLIP-ViT-Base-Patch16

介绍

这是中国CLIP的基础版,使用ViT-B/16作为图像编码器和RoBERTa-wwm-base作为文本编码器。中国CLIP是在大约2亿个中文图像-文本对上的大规模数据集上简单实现的CLIP。有关更多详细信息,请参阅我们的技术报告 https://arxiv.org/abs/2211.01335 和我们的官方GitHub仓库 https://github.com/OFA-Sys/Chinese-CLIP (欢迎点赞!??)

使用官方API

我们提供了一个简单的代码片段,展示如何使用中国CLIP的API计算图像和文本的嵌入向量和相似度。

from PIL import Image
import requests
from transformers import ChineseCLIPProcessor, ChineseCLIPModel

model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")

url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
# Squirtle, Bulbasaur, Charmander, Pikachu in English
texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]

# compute image feature
inputs = processor(images=image, return_tensors="pt")
image_features = model.get_image_features(**inputs)
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute text features
inputs = processor(text=texts, padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute image-text similarity scores
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # probs: [[1.2686e-03, 5.4499e-02, 6.7968e-04, 9.4355e-01]]

但是,如果您不满意只使用API,欢迎查看我们的GitHub仓库 https://github.com/OFA-Sys/Chinese-CLIP ,了解有关训练和推理的更多详细信息。

结果

MUGE文本到图像检索 :

Setup Zero-shot Finetune
Metric R@1 R@5 R@10 MR R@1 R@5 R@10 MR
Wukong 42.7 69.0 78.0 63.2 52.7 77.9 85.6 72.1
R2D2 49.5 75.7 83.2 69.5 60.1 82.9 89.4 77.5
CN-CLIP 63.0 84.1 89.2 78.8 68.9 88.7 93.1 83.6

Flickr30K-CN 检索 :

Task Text-to-Image Image-to-Text
Setup Zero-shot Finetune Zero-shot Finetune
Metric R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
Wukong 51.7 78.9 86.3 77.4 94.5 97.0 76.1 94.8 97.5 92.7 99.1 99.6
R2D2 60.9 86.8 92.7 84.4 96.7 98.4 77.6 96.7 98.9 95.6 99.8 100.0
CN-CLIP 71.2 91.4 95.5 83.8 96.9 98.6 81.6 97.5 98.8 95.3 99.7 100.0

COCO-CN 检索 :

Task Text-to-Image Image-to-Text
Setup Zero-shot Finetune Zero-shot Finetune
Metric R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
Wukong 53.4 80.2 90.1 74.0 94.4 98.1 55.2 81.0 90.6 73.3 94.0 98.0
R2D2 56.4 85.0 93.1 79.1 96.5 98.9 63.3 89.3 95.7 79.3 97.1 98.7
CN-CLIP 69.2 89.9 96.1 81.5 96.9 99.1 63.0 86.6 92.9 83.5 97.3 99.2

零样本图像分类 :

Task CIFAR10 CIFAR100 DTD EuroSAT FER FGVC KITTI MNIST PC VOC
GIT 88.5 61.1 42.9 43.4 41.4 6.7 22.1 68.9 50.0 80.2
ALIGN 94.9 76.8 66.1 52.1 50.8 25.0 41.2 74.0 55.2 83.0
CLIP 94.9 77.0 56.0 63.0 48.3 33.3 11.5 79.0 62.3 84.0
Wukong 95.4 77.1 40.9 50.3 - - - - - -
CN-CLIP 96.0 79.7 51.2 52.0 55.1 26.2 49.9 79.4 63.5 84.9

引用

如果您觉得中国CLIP有帮助,请随意引用我们的论文。谢谢您的支持!

@article{chinese-clip,
  title={Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese},
  author={Yang, An and Pan, Junshu and Lin, Junyang and Men, Rui and Zhang, Yichang and Zhou, Jingren and Zhou, Chang},
  journal={arXiv preprint arXiv:2211.01335},
  year={2022}
}