数据集:
AhmedSSabir/Textual-Image-Caption-Dataset
语言:
en#update: June-2023 add both soft/ hard-label to visual_caption_cosine_score (0.2, 0.3, 0.4, and 0.5)
Modern image captaining relies heavily on extracting knowledge, from images such as objects, to capture the concept of static story in the image. In this paper, we propose a textual visual context dataset for captioning, where the publicly available dataset COCO caption (Lin et al., 2014) has been extended with information about the scene (such as objects in the image). Since this information has textual form, it can be used to leverage any NLP task, such as text similarity or semantic relation methods, into captioning systems, either as an end-to-end training strategy or a post-processing based approach.
Please refer to project page and Github for more information.
For quick start please have a look this demo and pre-trained model with th 0.2, 0.3, 0.4
We enrich COCO-Caption with textual Visual Context information. We use ResNet152, CLIP, and Faster R-CNN to extract object information for each image. We use three filter approaches to ensure the quality of the dataset (1) Threshold: to filter out predictions where the object classifier is not confident enough, and (2) semantic alignment with semantic similarity to remove duplicated objects. (3) semantic relatedness score as soft-label: to guarantee the visual context and caption have a strong relation. In particular, we use Sentence-RoBERTa-sts via cosine similarity to give a soft score, and then we use a threshold to annotate the final label (if th ≥ 0.2, 0.3, 0.4 then 1,0). Finally, to take advantage of the visual overlap between caption and visual context, and to extract global information, we use BERT followed by a shallow 1D-CNN (Kim, 2014) to estimate the visual relatedness score.
For future work, we plan to extract the visual context from the caption (without using a visual classifier) and estimate the visual relatedness score by employing unsupervised learning (i.e. contrastive learning). (work in progress)
The details of this repo are described in the following paper. If you find this repo useful, please kindly cite it:
@article{sabir2023visual, title={Visual Semantic Relatedness Dataset for Image Captioning}, author={Sabir, Ahmed and Moreno-Noguer, Francesc and Padr{\'o}, Llu{\'\i}s}, journal={arXiv preprint arXiv:2301.08784}, year={2023} }