数据集:

MMInstruction/M3IT

中文

Dataset Card for M3IT

Project Page: M3IT

Languages

English and Chinese. 80 translated version can be found at M3IT-80 .

Dataset Statistics

Our dataset compiles diverse tasks of classical vision-language tasks, including captioning, visual question answering~(VQA), visual conditioned generation, reasoning and classification.

Instruction Statistics

Task #Instructions
Image Captioning 52
Classification 113
Visual Question Answering 95
Knowledgeable Visual QA 40
Reasoning 60
Generation 40
Total 400

Task Statistics

Task Description #Train #Val #Test
Image Captioning Given an image, write a description for the image. 679,087 41,462 27,499
Classification Given an image, classify the image into pre-defined categories. 238,303 100,069 21,206
Visual Question Answering Given an image, answer a question relevant to the image. 177,633 46,314 10,828
Knowledgeable Visual QA Given an image, answer the question requires outside knowledge. 39,981 11,682 5,477
Reasoning Given an image, conduct reasoning over the images. 99,372 11,500 10,000
Generation Given an image, make compositions with certain requirements. 145,000 11,315 17,350
Chinese CAP, CLS, VQA, and GEN tasks in Chinese. 192,076 77,306 4,100
Video CAP, CLS, and VQA tasks on video-language datasets. 20,868 7,542 9,294
Multi-lingual Translated tasks in 80 languages 0 240,000 184,000

Detailed Dataset Statistics

Task Dataset #Train #Val #Test
Image Captioning coco 566,747 25,010 25,010
textcap 97,765 13,965 0
image-paragraph-captioning 14,575 2,487 2,489
Classification coco-goi 30,000 2,000 0
coco-text 118,312 27,550 0
imagenet 30,000 50,000 0
coco-itm 30,000 5,000 5,000
snli-ve 20,000 14,339 14,740
mocheg 4,991 180 466
iqa 5,000 1,000 1,000
Visual Question Answering vqa-v2 30,000 30,000 0
shapes 13,568 1,024 1,024
docvqa 39,463 5,349 0
ocr-vqa 11,414 4,940 0
st-vqa 26,074 0 4,070
text-vqa 27,113 0 5,734
gqa 30,001 5,001 0
Knowledgeable Visual QA okvqa 9,009 5,046 0
a-okvqa 17,056 1,145 0
science-qa 12,726 4,241 4,241
viquae 1,190 1,250 1,236
Reasoning clevr 30,000 2,000 0
nlvr 29,372 2,000 0
vcr 25,000 5,000 5,000
visual-mrc 15,000 2,500 5,000
winoground 0 0 800
Generation vist 5,000 4,315 4,350
visual-dialog 50,000 1,000 1,000
multi30k 90,000 6,000 12,000
Chinese fm-iqa 164,735 75,206 0
coco-cn 18,341 1,000 1,000
flickr8k-cn 6,000 1,000 1,000
chinese-food 0 0 1,100
mmchat 3,000 1,000 1,000
Video ss 2,000 2,000 2,000
ivqa 5,994 2,000 2,000
msvd-qa 1,161 245 504
activitynet-qa 3,200 1,800 800
msrvtt 6,513 497 2,990
msrvtt-qa 2,000 1,000 1,000

Dataset Structure

HuggingFace Login (Optional)

# OR run huggingface-cli login
from huggingface_hub import login

hf_token = "hf_xxx"  # TODO: set a valid HuggingFace access token for loading datasets/models
login(token=hf_token)

Data Loading

from datasets import load_dataset

ds_name = "coco"  # change the dataset name here
dataset = load_dataset("MMInstruction/M3IT", ds_name)

Data Splits

from datasets import load_dataset

ds_name = "coco"  # change the dataset name here
dataset = load_dataset("MMInstruction/M3IT", ds_name)
train_set = dataset["train"]
validation_set = dataset["validation"]
test_set = dataset["test"]

Data Instances

from datasets import load_dataset
from io import BytesIO
from base64 import b64decode
from PIL import Image

ds_name = "coco"  # change the dataset name here
dataset = load_dataset("MMInstruction/M3IT", ds_name)
train_set = dataset["train"]

for train_instance in train_set:
    instruction = train_instance["instruction"]  # str
    inputs = train_instance["inputs"]  # str
    outputs = train_instance["outputs"]  # str
    image_base64_str_list = train_instance["image_base64_str"]  # str (base64)
    image_0 = Image.open(BytesIO(b64decode(image_base64_str_list[0])))

Data Fields

import datasets

features = datasets.Features(
    {
        "instruction": datasets.Value("string"),
        "inputs": datasets.Value("string"),
        "image_base64_str": [datasets.Value("string")],
        "outputs": datasets.Value("string"),
    }
)

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Task Dataset [Citation] Source
Image Captioning coco [1] Source
textcap [2] Source
image-paragraph-captioning [3] Source
Classification coco-goi [1] Source
coco-text [4] Source
imagenet [5] Source
coco-itm [1] Source
snli-ve [6] Source
mocheg [7] Source
iqa [8] Source
Visual Question Answering vqa-v2 [9] Source
shapes [10] Source
docvqa [11] Source
ocr-vqa [12] Source
st-vqa [13] Source
text-vqa [14] Source
gqa [15] Source
Knowledgeable Visual QA okvqa [16] Source
a-okvqa [17] Source
science-qa [18] Source
viquae [19] Source
Reasoning clevr [20] Source
nlvr [21] Source
vcr [22] Source
visual-mrc [23] Source
winoground [24] Source
Generation vist [25] Source
visual-dialog [26] Source
multi30k [27] Source
Chinese fm-iqa [28] Source
coco-cn [29] Source
flickr8k-cn [30] Source
chinese-food [31] Source
mmchat [32] Source
Video ss [33] Source
ivqa [34] Source
msvd-qa [35] Source
activitynet-qa [36] Source
msrvtt [35] Source
msrvtt-qa [37] Source

Annotations

Annotation process

To build high-quality multimodal instruction datasets, we rewrite various datasets into multimodal-to-text dialog format. The annotation process includes four steps:

  • (1) Stage I: Instruction Writing : writing instructions for each task;
  • (2) Stage II: Data Format Unification : structuring images and texts into a unified schema;
  • (3) Stage III: Quality Check : checking the overall dataset quality;
  • (4) Stage IV: Key Datasets Translation : building multilingual sets.
Who are the annotators?

Eight authors of this work are employed as human annotators, each of whom is a graduate student familiar with relevant literature.

Additional Information

Licensing Information

The content of original dataset follows their original license. We suggest that for the task with Unknown/Custom license, the user can check the original project or contact the dataset owner for detailed license information.

Our annotated instruction data is licensed under CC BY 4.0 .

Citation Information

@article{li2023m3it,
  title={M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning},
  author={Lei Li and Yuwei Yin and Shicheng Li and Liang Chen and Peiyi Wang and Shuhuai Ren and Mukai Li and Yazheng Yang and Jingjing Xu and Xu Sun and Lingpeng Kong and Qi Liu},
  journal={arXiv preprint arXiv:2306.04387},
  year={2023}
}

Contributions

M3IT is an open-source, large-scale Multi-modal, Multilingual Instruction Tuning dataset, designed to enable the development of general-purpose multi-modal agents.

References

  • [1] Microsoft COCO: Common Objects in Context
  • [2] TextCaps: a dataset for image captioning with reading comprehension
  • [3] A Hierarchical Approach for Generating Descriptive Image Paragraphs
  • [4] COCO-Text: Dataset and benchmark for text detection and recognition in natural images
  • [5] Imagenet large scale visual recognition challenge
  • [6] E-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks
  • [7] End-to-End Multimodal Fact-Checking and Explanation Generation: A Challenging Dataset and Models
  • [8] Quantifying visual image quality: A Bayesian view
  • [9] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
  • [10] Neural Module Networks
  • [11] DocVQA: A dataset for vqa on document images
  • [12] OCR-VQA: Visual Question Answering by Reading Text in Images
  • [13] Scene Text Visual Question Answering
  • [14] Towards VQA Models That Can Read
  • [15] GQA: A new dataset for real-world visual reasoning and compositional question answering
  • [16] OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
  • [17] A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge
  • [18] Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
  • [19] ViQuAE: a dataset for knowledge-based visual question answering about named entities
  • [20] CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning
  • [21] A Corpus of Natural Language for Visual Reasoning
  • [22] From recognition to cognition: Visual Commonsense Reasoning
  • [23] VisualMRC: Machine reading comprehension on document images
  • [24] WinoGround: Probing vision and language models for visio-linguistic compositionality
  • [25] Visual Storytelling
  • [26] Visual Dialog
  • [27] Multi30k: Multilingual english-german image descriptions
  • [28] Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question
  • [29] COCO-CN for cross-lingual image tagging, captioning, and retrieval
  • [30] Adding Chinese Captions to Images
  • [31] ChineseFoodNet: A large-scale image dataset for chinese food recognition
  • [32] MMChat: Multi-Modal Chat Dataset on Social Media
  • [33] The "Something Something" Video Database for Learning and Evaluating Visual Common Sense
  • [34] Just Ask: Learning to answer questions from millions of narrated videos
  • [35] Video Question Answering via Gradually Refined Attention over Appearance and Motion
  • [36] ActivityNet-qa: A dataset for understanding complex web videos via question answering
  • [37] MSR-VTT: A large video description dataset for bridging video and language