模型:
dandelin/vilt-b32-finetuned-coco
Vision-and-Language Transformer (ViLT) model fine-tuned on COCO . It was introduced in the paper ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision by Kim et al. and first released in this repository .
Disclaimer: The team releasing ViLT did not write a model card for this model so this model card has been written by the Hugging Face team.
You can use the model for image and text retrieval.
Here is how to use the model in PyTorch:
from transformers import ViltProcessor, ViltForImageAndTextRetrieval import requests from PIL import Image url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) texts = ["An image of two cats chilling on a couch", "A football player scoring a goal"] processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-coco") model = ViltForImageAndTextRetrieval.from_pretrained("dandelin/vilt-b32-finetuned-coco") # prepare inputs encoding = processor(image, text, return_tensors="pt") # forward pass scores = dict() for text in texts: encoding = processor(image, text, return_tensors="pt") outputs = model(**encoding) scores[text] = outputs.logits[0, :].item()
(to do)
(to do)
(to do)
(to do)
@misc{kim2021vilt, title={ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision}, author={Wonjae Kim and Bokyung Son and Ildoo Kim}, year={2021}, eprint={2102.03334}, archivePrefix={arXiv}, primaryClass={stat.ML} }