中文

#update: June-2023 add both soft/ hard-label to visual_caption_cosine_score (0.2, 0.3, 0.4, and 0.5)

Introduction

Modern image captaining relies heavily on extracting knowledge, from images such as objects, to capture the concept of static story in the image. In this paper, we propose a textual visual context dataset for captioning, where the publicly available dataset COCO caption (Lin et al., 2014) has been extended with information about the scene (such as objects in the image). Since this information has textual form, it can be used to leverage any NLP task, such as text similarity or semantic relation methods, into captioning systems, either as an end-to-end training strategy or a post-processing based approach.

Please refer to project page and Github for more information.

For quick start please have a look this demo and pre-trained model with th 0.2, 0.3, 0.4

Overview

We enrich COCO-Caption with textual Visual Context information. We use ResNet152, CLIP, and Faster R-CNN to extract object information for each image. We use three filter approaches to ensure the quality of the dataset (1) Threshold: to filter out predictions where the object classifier is not confident enough, and (2) semantic alignment with semantic similarity to remove duplicated objects. (3) semantic relatedness score as soft-label: to guarantee the visual context and caption have a strong relation. In particular, we use Sentence-RoBERTa-sts via cosine similarity to give a soft score, and then we use a threshold to annotate the final label (if th ≥ 0.2, 0.3, 0.4 then 1,0). Finally, to take advantage of the visual overlap between caption and visual context, and to extract global information, we use BERT followed by a shallow 1D-CNN (Kim, 2014) to estimate the visual relatedness score.

Download

  • Dowload Raw data with ID and Visual context -> original dataset with related ID caption train2014
  • Downlod Data with cosine score -> soft cosine lable with th 0.2, 0.3, 0.4 and 0.5 and hardlabel [0,1]
  • Dowload Overlaping visual with caption -> Overlap visual context and the human annotated caption
  • Download Dataset (tsv file) 0.0-> raw data with hard lable without cosine similairty and with th reshold cosine sim degree of the relation beteween the visual and caption = 0.2, 0.3, 0.4
  • Download Dataset GenderBias -> man/woman replaced with person class label
  • For future work, we plan to extract the visual context from the caption (without using a visual classifier) and estimate the visual relatedness score by employing unsupervised learning (i.e. contrastive learning). (work in progress)

  • Download CC -> Caption dataset from Conceptinal Caption (CC) 2M (2255927 captions)
  • Download CC+wiki -> CC+1M-wiki 3M (3255928)
  • Download CC+wiki+COCO -> CC+wiki+COCO-Caption 3.5M (366984)
  • Download COCO-caption+wiki -> COCO-caption +wiki 1.4M (1413915)
  • Download COCO-caption+wiki+CC+8Mwiki -> COCO-caption+wiki+CC+8Mwiki 11M (11541667)
  • Citation

    The details of this repo are described in the following paper. If you find this repo useful, please kindly cite it:

    @article{sabir2023visual,
      title={Visual Semantic Relatedness Dataset for Image Captioning},
      author={Sabir, Ahmed and Moreno-Noguer, Francesc and Padr{\'o}, Llu{\'\i}s},
      journal={arXiv preprint arXiv:2301.08784},
      year={2023}
    }