登录注册

行业学习

支撑

数据算量系统

企业招聘智能体

下载

模型数据集

AI工具箱

企业服务

EVA 智能HR

ChatGPT 人工智能应用人工智能未来计算机视觉

热门新闻

公司板科大讯飞

科大讯飞包揽ICPR MTWI图文识别挑战赛三项冠军

行业人工智能

所以，能动手就别吵吵了

行业人工智能

人工智能对 IT 技能和人才发展的影响

常用工具

OpenAI旗下AI对话工具

字节跳动旗下团队推出的免费AI英语写作助手

AI图像和插画生成工具，测试测试测试测试测试测测试

Stable Diffusion

StabilityAI推出的文本到图像生成AI

GitHub AI编程工具

您尚未登录账户

请先登录您的atyun账户，方可使用该功能

仅限企业账户使用

该功能仅限企业账号使用，开通企业账号可享受更多服务，是否现在注册企业账号？

立即注册企业账号

暂不需要

您的企业账号申请正在审核中

审核通过后即可使用此功能，请耐心等待~

数据集:

visual_genome

任务:

子任务:

image-captioning

语言:

计算机处理:

大小:

语言创建人:

批注创建人:

源数据集:

预印本库:

arxiv:1602.07332

许可:

数据集介绍文件清单

Dataset Card for Visual Genome

Dataset Summary

Visual Genome is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language.

From the paper:

Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked “What vehicle is the person riding?”, computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) to answer correctly that “the person is riding a horse-drawn carriage.”

Visual Genome has:

108,077 image
5.4 Million Region Descriptions
1.7 Million Visual Question Answers
3.8 Million Object Instances
2.8 Million Attributes
2.3 Million Relationships

From the paper:

Our dataset contains over 108K images where each image has an average of 35 objects, 26 attributes, and 21 pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets.

Dataset Preprocessing

Supported Tasks and Leaderboards

Languages

All of annotations use English as primary language.

Dataset Structure

Data Instances

When loading a specific configuration, users has to append a version dependent suffix:

from datasets import load_dataset
load_dataset("visual_genome", "region_description_v1.2.0")

region_descriptions

An example of looks as follows.

{
  "image": <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=800x600 at 0x7F2F60698610>,
  "image_id": 1,
  "url": "https://cs.stanford.edu/people/rak248/VG_100K_2/1.jpg",
  "width": 800,
  "height": 600,
  "coco_id": null,
  "flickr_id": null,
  "regions": [
    {
      "region_id": 1382,
      "image_id": 1,
      "phrase": "the clock is green in colour",
      "x": 421,
      "y": 57,
      "width": 82,
      "height": 139
    },
    ...
  ]
}

objects

An example of looks as follows.

{
  "image": <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=800x600 at 0x7F2F60698610>,
  "image_id": 1,
  "url": "https://cs.stanford.edu/people/rak248/VG_100K_2/1.jpg",
  "width": 800,
  "height": 600,
  "coco_id": null,
  "flickr_id": null,
  "objects": [
    {
      "object_id": 1058498,
      "x": 421,
      "y": 91,
      "w": 79,
      "h": 339,
      "names": [
        "clock"
      ],
      "synsets": [
        "clock.n.01"
      ]
    },
    ...
  ]
}

attributes

An example of looks as follows.

{
  "image": <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=800x600 at 0x7F2F60698610>,
  "image_id": 1,
  "url": "https://cs.stanford.edu/people/rak248/VG_100K_2/1.jpg",
  "width": 800,
  "height": 600,
  "coco_id": null,
  "flickr_id": null,
  "attributes": [
    {
      "object_id": 1058498,
      "x": 421,
      "y": 91,
      "w": 79,
      "h": 339,
      "names": [
        "clock"
      ],
      "synsets": [
        "clock.n.01"
      ],
      "attributes": [
        "green",
        "tall"
      ]
    },
    ...
  }
]

relationships

An example of looks as follows.

{
  "image": <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=800x600 at 0x7F2F60698610>,
  "image_id": 1,
  "url": "https://cs.stanford.edu/people/rak248/VG_100K_2/1.jpg",
  "width": 800,
  "height": 600,
  "coco_id": null,
  "flickr_id": null,
  "relationships": [
    {
      "relationship_id": 15927,
      "predicate": "ON",
      "synsets": "['along.r.01']",
      "subject": {
        "object_id": 5045,
        "x": 119,
        "y": 338,
        "w": 274,
        "h": 192,
        "names": [
          "shade"
        ],
        "synsets": [
          "shade.n.01"
        ]
      },
      "object": {
        "object_id": 5046,
        "x": 77,
        "y": 328,
        "w": 714,
        "h": 262,
        "names": [
          "street"
        ],
        "synsets": [
          "street.n.01"
        ]
      }
    }
    ...
  }
]

question_answers

An example of looks as follows.

{
  "image": <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=800x600 at 0x7F2F60698610>,
  "image_id": 1,
  "url": "https://cs.stanford.edu/people/rak248/VG_100K_2/1.jpg",
  "width": 800,
  "height": 600,
  "coco_id": null,
  "flickr_id": null,
  "qas": [
    {
      "qa_id": 986768,
      "image_id": 1,
      "question": "What color is the clock?",
      "answer": "Green.",
      "a_objects": [],
      "q_objects": []
    },
    ...
  }
]

Data Fields

When loading a specific configuration, users has to append a version dependent suffix:

from datasets import load_dataset
load_dataset("visual_genome", "region_description_v1.2.0")

region_descriptions

image : A PIL.Image.Image object containing the image. Note that when accessing the image column: dataset[0]["image"] the image file is automatically decoded. Decoding of a large number of image files might take a significant amount of time. Thus it is important to first query the sample index before the "image" column, i.e. dataset[0]["image"] should always be preferred over dataset["image"][0]
image_id : Unique numeric ID of the image.
url : URL of source image.
width : Image width.
height : Image height.
coco_id : Id mapping to MSCOCO indexing.
flickr_id : Id mapping to Flicker indexing.
regions : Holds a list of Region dataclasses:
- region_id : Unique numeric ID of the region.
- image_id : Unique numeric ID of the image.
- x : x coordinate of bounding box's top left corner.
- y : y coordinate of bounding box's top left corner.
- width : Bounding box width.
- height : Bounding box height.

objects

image : A PIL.Image.Image object containing the image. Note that when accessing the image column: dataset[0]["image"] the image file is automatically decoded. Decoding of a large number of image files might take a significant amount of time. Thus it is important to first query the sample index before the "image" column, i.e. dataset[0]["image"] should always be preferred over dataset["image"][0]
image_id : Unique numeric ID of the image.
url : URL of source image.
width : Image width.
height : Image height.
coco_id : Id mapping to MSCOCO indexing.
flickr_id : Id mapping to Flicker indexing.
objects : Holds a list of Object dataclasses:
- object_id : Unique numeric ID of the object.
- x : x coordinate of bounding box's top left corner.
- y : y coordinate of bounding box's top left corner.
- w : Bounding box width.
- h : Bounding box height.
- names : List of names associated with the object. This field can hold multiple values in the sense the multiple names are considered as acceptable. For example: ['monitor', 'computer'] at https://cs.stanford.edu/people/rak248/VG_100K/3.jpg
- synsets : List of WordNet synsets .

attributes

image : A PIL.Image.Image object containing the image. Note that when accessing the image column: dataset[0]["image"] the image file is automatically decoded. Decoding of a large number of image files might take a significant amount of time. Thus it is important to first query the sample index before the "image" column, i.e. dataset[0]["image"] should always be preferred over dataset["image"][0]
image_id : Unique numeric ID of the image.
url : URL of source image.
width : Image width.
height : Image height.
coco_id : Id mapping to MSCOCO indexing.
flickr_id : Id mapping to Flicker indexing.
attributes : Holds a list of Object dataclasses:
- object_id : Unique numeric ID of the region.
- x : x coordinate of bounding box's top left corner.
- y : y coordinate of bounding box's top left corner.
- w : Bounding box width.
- h : Bounding box height.
- names : List of names associated with the object. This field can hold multiple values in the sense the multiple names are considered as acceptable. For example: ['monitor', 'computer'] at https://cs.stanford.edu/people/rak248/VG_100K/3.jpg
- synsets : List of WordNet synsets .
- attributes : List of attributes associated with the object.

relationships

image : A PIL.Image.Image object containing the image. Note that when accessing the image column: dataset[0]["image"] the image file is automatically decoded. Decoding of a large number of image files might take a significant amount of time. Thus it is important to first query the sample index before the "image" column, i.e. dataset[0]["image"] should always be preferred over dataset["image"][0]
image_id : Unique numeric ID of the image.
url : URL of source image.
width : Image width.
height : Image height.
coco_id : Id mapping to MSCOCO indexing.
flickr_id : Id mapping to Flicker indexing.
relationships : Holds a list of Relationship dataclasses:
- relationship_id : Unique numeric ID of the object.
- predicate : Predicate defining relationship between a subject and an object.
- synsets : List of WordNet synsets .
- subject : Object dataclass. See subsection on objects .
- object : Object dataclass. See subsection on objects .

question_answers

image : A PIL.Image.Image object containing the image. Note that when accessing the image column: dataset[0]["image"] the image file is automatically decoded. Decoding of a large number of image files might take a significant amount of time. Thus it is important to first query the sample index before the "image" column, i.e. dataset[0]["image"] should always be preferred over dataset["image"][0]
image_id : Unique numeric ID of the image.
url : URL of source image.
width : Image width.
height : Image height.
coco_id : Id mapping to MSCOCO indexing.
flickr_id : Id mapping to Flicker indexing.
qas : Holds a list of Question-Answering dataclasses:
- qa_id : Unique numeric ID of the question-answer pair.
- image_id : Unique numeric ID of the image.
- question : Question.
- answer : Answer.
- q_objects : List of object dataclass associated with question field. See subsection on objects .
- a_objects : List of object dataclass associated with answer field. See subsection on objects .

Data Splits

All the data is contained in training set.

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization Who are the source language producers?

Annotations

Annotation process Who are the annotators?

From the paper:

We used Amazon Mechanical Turk (AMT) as our primary source of annotations. Overall, a total of over 33, 000 unique workers contributed to the dataset. The dataset was collected over the course of 6 months after 15 months of experimentation and iteration on the data representation. Approximately 800, 000 Human Intelligence Tasks (HITs) were launched on AMT, where each HIT involved creating descriptions, questions and answers, or region graphs. Each HIT was designed such that workers manage to earn anywhere between $6-$8 per hour if they work continuously, in line with ethical research standards on Mechanical Turk (Salehi et al., 2015). Visual Genome HITs achieved a 94.1% retention rate, meaning that 94.1% of workers who completed one of our tasks went ahead to do more. [...] 93.02% of workers contributed from the United States. The majority of our workers were between the ages of 25 and 34 years old. Our youngest contributor was 18 years and the oldest was 68 years old. We also had a near-balanced split of 54.15% male and 45.85% female workers.

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Visual Genome by Ranjay Krishna is licensed under a Creative Commons Attribution 4.0 International License.

Citation Information

@article{Krishna2016VisualGC,
  title={Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations},
  author={Ranjay Krishna and Yuke Zhu and Oliver Groth and Justin Johnson and Kenji Hata and Joshua Kravitz and Stephanie Chen and Yannis Kalantidis and Li-Jia Li and David A. Shamma and Michael S. Bernstein and Li Fei-Fei},
  journal={International Journal of Computer Vision},
  year={2017},
  volume={123},
  pages={32-73},
  url={https://doi.org/10.1007/s11263-016-0981-7},
  doi={10.1007/s11263-016-0981-7}
}

Contributions

Due to limitation of the dummy_data creation, we provide a fix_generated_dummy_data.py script that fix the dataset in-place.

Thanks to @thomasw21 for adding this dataset.

作者:

佚名

数据集大小:

39.67 KB

相关推荐