数据集:
visual_genome
子任务:
image-captioning语言:
en计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found批注创建人:
found源数据集:
original预印本库:
arxiv:1602.07332许可:
cc-by-4.0Visual Genome is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language.
From the paper:
Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked “What vehicle is the person riding?”, computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) to answer correctly that “the person is riding a horse-drawn carriage.”
Visual Genome has:
From the paper:
Our dataset contains over 108K images where each image has an average of 35 objects, 26 attributes, and 21 pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets.
All of annotations use English as primary language.
When loading a specific configuration, users has to append a version dependent suffix:
from datasets import load_dataset load_dataset("visual_genome", "region_description_v1.2.0")region_descriptions
An example of looks as follows.
{ "image": <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=800x600 at 0x7F2F60698610>, "image_id": 1, "url": "https://cs.stanford.edu/people/rak248/VG_100K_2/1.jpg", "width": 800, "height": 600, "coco_id": null, "flickr_id": null, "regions": [ { "region_id": 1382, "image_id": 1, "phrase": "the clock is green in colour", "x": 421, "y": 57, "width": 82, "height": 139 }, ... ] }objects
An example of looks as follows.
{ "image": <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=800x600 at 0x7F2F60698610>, "image_id": 1, "url": "https://cs.stanford.edu/people/rak248/VG_100K_2/1.jpg", "width": 800, "height": 600, "coco_id": null, "flickr_id": null, "objects": [ { "object_id": 1058498, "x": 421, "y": 91, "w": 79, "h": 339, "names": [ "clock" ], "synsets": [ "clock.n.01" ] }, ... ] }attributes
An example of looks as follows.
{ "image": <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=800x600 at 0x7F2F60698610>, "image_id": 1, "url": "https://cs.stanford.edu/people/rak248/VG_100K_2/1.jpg", "width": 800, "height": 600, "coco_id": null, "flickr_id": null, "attributes": [ { "object_id": 1058498, "x": 421, "y": 91, "w": 79, "h": 339, "names": [ "clock" ], "synsets": [ "clock.n.01" ], "attributes": [ "green", "tall" ] }, ... } ]relationships
An example of looks as follows.
{ "image": <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=800x600 at 0x7F2F60698610>, "image_id": 1, "url": "https://cs.stanford.edu/people/rak248/VG_100K_2/1.jpg", "width": 800, "height": 600, "coco_id": null, "flickr_id": null, "relationships": [ { "relationship_id": 15927, "predicate": "ON", "synsets": "['along.r.01']", "subject": { "object_id": 5045, "x": 119, "y": 338, "w": 274, "h": 192, "names": [ "shade" ], "synsets": [ "shade.n.01" ] }, "object": { "object_id": 5046, "x": 77, "y": 328, "w": 714, "h": 262, "names": [ "street" ], "synsets": [ "street.n.01" ] } } ... } ]question_answers
An example of looks as follows.
{ "image": <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=800x600 at 0x7F2F60698610>, "image_id": 1, "url": "https://cs.stanford.edu/people/rak248/VG_100K_2/1.jpg", "width": 800, "height": 600, "coco_id": null, "flickr_id": null, "qas": [ { "qa_id": 986768, "image_id": 1, "question": "What color is the clock?", "answer": "Green.", "a_objects": [], "q_objects": [] }, ... } ]
When loading a specific configuration, users has to append a version dependent suffix:
from datasets import load_dataset load_dataset("visual_genome", "region_description_v1.2.0")region_descriptions
All the data is contained in training set.
From the paper:
We used Amazon Mechanical Turk (AMT) as our primary source of annotations. Overall, a total of over 33, 000 unique workers contributed to the dataset. The dataset was collected over the course of 6 months after 15 months of experimentation and iteration on the data representation. Approximately 800, 000 Human Intelligence Tasks (HITs) were launched on AMT, where each HIT involved creating descriptions, questions and answers, or region graphs. Each HIT was designed such that workers manage to earn anywhere between $6-$8 per hour if they work continuously, in line with ethical research standards on Mechanical Turk (Salehi et al., 2015). Visual Genome HITs achieved a 94.1% retention rate, meaning that 94.1% of workers who completed one of our tasks went ahead to do more. [...] 93.02% of workers contributed from the United States. The majority of our workers were between the ages of 25 and 34 years old. Our youngest contributor was 18 years and the oldest was 68 years old. We also had a near-balanced split of 54.15% male and 45.85% female workers.
Visual Genome by Ranjay Krishna is licensed under a Creative Commons Attribution 4.0 International License.
@article{Krishna2016VisualGC, title={Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations}, author={Ranjay Krishna and Yuke Zhu and Oliver Groth and Justin Johnson and Kenji Hata and Joshua Kravitz and Stephanie Chen and Yannis Kalantidis and Li-Jia Li and David A. Shamma and Michael S. Bernstein and Li Fei-Fei}, journal={International Journal of Computer Vision}, year={2017}, volume={123}, pages={32-73}, url={https://doi.org/10.1007/s11263-016-0981-7}, doi={10.1007/s11263-016-0981-7} }
Due to limitation of the dummy_data creation, we provide a fix_generated_dummy_data.py script that fix the dataset in-place.
Thanks to @thomasw21 for adding this dataset.