模型:

tifa-benchmark/promptcap-coco-vqa

任务:

图生文

类库:

PyTorch Transformers

数据集:

coco textvqa VQAv2 OK-VQA A-OKVQA 3AA-OKVQA 3AOK-VQA 3AVQAv2 3Atextvqa 3Acoco

语言:

其他:

ofa 视觉问答 image-captioning

预印本库:

arxiv:2211.09699

许可:

openrail

模型介绍文件清单

中文

This is the repo for the paper PromptCap: Prompt-Guided Task-Aware Image Captioning

We introduce PromptCap, a captioning model that can be controlled by natural language instruction. The instruction may contain a question that the user is interested in. For example, "what is the boy putting on?". PromptCap also supports generic caption, using the question "what does the image describe?"

PromptCap can serve as a light-weight visual plug-in (much faster than BLIP-2) for LLM like GPT-3, ChatGPT, and other foundation models like Segment Anything and DINO. It achieves SOTA performance on COCO captioning (150 CIDEr). When paired with GPT-3, and conditioned on user question, PromptCap get SOTA performance on knowledge-based VQA tasks (60.4% on OK-VQA and 59.6% on A-OKVQA)

QuickStart

Installation

pip install promptcap

Two pipelines are included. One is for image captioning, and the other is for visual question answering.

Captioning Pipeline

Please follow the prompt format, which will give the best performance.

Generate a prompt-guided caption by following:

import torch
from promptcap import PromptCap

model = PromptCap("vqascore/promptcap-coco-vqa")  # also support OFA checkpoints. e.g. "OFA-Sys/ofa-large"

if torch.cuda.is_available():
  model.cuda()

prompt = "please describe this image according to the given question: what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"

print(model.caption(prompt, image))

To try generic captioning, just use "what does the image describe?"

prompt = "what does the image describe?"
image = "glove_boy.jpeg"

print(model.caption(prompt, image))

PromptCap also support taking OCR inputs:

prompt = "please describe this image according to the given question: what year was this taken?"
image = "dvds.jpg"
ocr = "yip AE Mht juor 02/14/2012"

print(model.caption(prompt, image, ocr))

Visual Question Answering Pipeline

Different from typical VQA models, which are doing classification on VQAv2, PromptCap is open-domain and can be paired with arbitrary text-QA models. Here we provide a pipeline for combining PromptCap with UnifiedQA.

import torch
from promptcap import PromptCap_VQA

# QA model support all UnifiedQA variants. e.g. "allenai/unifiedqa-v2-t5-large-1251000"
vqa_model = PromptCap_VQA(promptcap_model="vqascore/promptcap-coco-vqa", qa_model="allenai/unifiedqa-t5-base")

if torch.cuda.is_available():
  vqa_model.cuda()

question = "what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"

print(vqa_model.vqa(question, image))

Similarly, PromptCap supports OCR inputs

question = "what year was this taken?"
image = "dvds.jpg"
ocr = "yip AE Mht juor 02/14/2012"

print(vqa_model.vqa(question, image, ocr=ocr))

Because of the flexibility of Unifiedqa, PromptCap also supports multiple-choice VQA

question = "what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"
choices = ["gloves", "socks", "shoes", "coats"]
print(vqa_model.vqa_multiple_choice(question, image, choices))

Bibtex

@article{hu2022promptcap,
  title={PromptCap: Prompt-Guided Task-Aware Image Captioning},
  author={Hu, Yushi and Hua, Hang and Yang, Zhengyuan and Shi, Weijia and Smith, Noah A and Luo, Jiebo},
  journal={arXiv preprint arXiv:2211.09699},
  year={2022}
}

作者:

TIFA

数据集大小:

2.28 GB