模型:

OFA-Sys/chinese-clip-vit-large-patch14

中文

Chinese-CLIP-ViT-Large-Patch14

Introduction

This is the large-version of the Chinese CLIP, with ViT-L/14 as the image encoder and RoBERTa-wwm-base as the text encoder. Chinese CLIP is a simple implementation of CLIP on a large-scale dataset of around 200 million Chinese image-text pairs. For more details, please refer to our technical report https://arxiv.org/abs/2211.01335 and our official github repo https://github.com/OFA-Sys/Chinese-CLIP (Welcome to star! ??)

Use with the official API

We provide a simple code snippet to show how to use the API of Chinese-CLIP to compute the image & text embeddings and similarities.

from PIL import Image
import requests
from transformers import ChineseCLIPProcessor, ChineseCLIPModel

model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-large-patch14")
processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-large-patch14")

url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
# Squirtle, Bulbasaur, Charmander, Pikachu in English
texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]

# compute image feature
inputs = processor(images=image, return_tensors="pt")
image_features = model.get_image_features(**inputs)
image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute text features
inputs = processor(text=texts, padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)
text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)  # normalize

# compute image-text similarity scores
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # probs: [[0.0066, 0.0211, 0.0031, 0.9692]]

However, if you are not satisfied with only using the API, feel free to check our github repo https://github.com/OFA-Sys/Chinese-CLIP for more details about training and inference.

Results

MUGE Text-to-Image Retrieval :

Setup Zero-shot Finetune
Metric R@1 R@5 R@10 MR R@1 R@5 R@10 MR
Wukong 42.7 69.0 78.0 63.2 52.7 77.9 85.6 72.1
R2D2 49.5 75.7 83.2 69.5 60.1 82.9 89.4 77.5
CN-CLIP 63.0 84.1 89.2 78.8 68.9 88.7 93.1 83.6

Flickr30K-CN Retrieval :

Task Text-to-Image Image-to-Text
Setup Zero-shot Finetune Zero-shot Finetune
Metric R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
Wukong 51.7 78.9 86.3 77.4 94.5 97.0 76.1 94.8 97.5 92.7 99.1 99.6
R2D2 60.9 86.8 92.7 84.4 96.7 98.4 77.6 96.7 98.9 95.6 99.8 100.0
CN-CLIP 71.2 91.4 95.5 83.8 96.9 98.6 81.6 97.5 98.8 95.3 99.7 100.0

COCO-CN Retrieval :

Task Text-to-Image Image-to-Text
Setup Zero-shot Finetune Zero-shot Finetune
Metric R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
Wukong 53.4 80.2 90.1 74.0 94.4 98.1 55.2 81.0 90.6 73.3 94.0 98.0
R2D2 56.4 85.0 93.1 79.1 96.5 98.9 63.3 89.3 95.7 79.3 97.1 98.7
CN-CLIP 69.2 89.9 96.1 81.5 96.9 99.1 63.0 86.6 92.9 83.5 97.3 99.2

Zero-shot Image Classification :

Task CIFAR10 CIFAR100 DTD EuroSAT FER FGVC KITTI MNIST PC VOC
GIT 88.5 61.1 42.9 43.4 41.4 6.7 22.1 68.9 50.0 80.2
ALIGN 94.9 76.8 66.1 52.1 50.8 25.0 41.2 74.0 55.2 83.0
CLIP 94.9 77.0 56.0 63.0 48.3 33.3 11.5 79.0 62.3 84.0
Wukong 95.4 77.1 40.9 50.3 - - - - - -
CN-CLIP 96.0 79.7 51.2 52.0 55.1 26.2 49.9 79.4 63.5 84.9

Citation

If you find Chinese CLIP helpful, feel free to cite our paper. Thanks for your support!

@article{chinese-clip,
  title={Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese},
  author={Yang, An and Pan, Junshu and Lin, Junyang and Men, Rui and Zhang, Yichang and Zhou, Jingren and Zhou, Chang},
  journal={arXiv preprint arXiv:2211.01335},
  year={2022}
}