Project Page: M3IT
English and Chinese. 80 translated version can be found at M3IT-80 .
Our dataset compiles diverse tasks of classical vision-language tasks, including captioning, visual question answering~(VQA), visual conditioned generation, reasoning and classification.
Task | #Instructions |
---|---|
Image Captioning | 52 |
Classification | 113 |
Visual Question Answering | 95 |
Knowledgeable Visual QA | 40 |
Reasoning | 60 |
Generation | 40 |
Total | 400 |
Task | Description | #Train | #Val | #Test |
---|---|---|---|---|
Image Captioning | Given an image, write a description for the image. | 679,087 | 41,462 | 27,499 |
Classification | Given an image, classify the image into pre-defined categories. | 238,303 | 100,069 | 21,206 |
Visual Question Answering | Given an image, answer a question relevant to the image. | 177,633 | 46,314 | 10,828 |
Knowledgeable Visual QA | Given an image, answer the question requires outside knowledge. | 39,981 | 11,682 | 5,477 |
Reasoning | Given an image, conduct reasoning over the images. | 99,372 | 11,500 | 10,000 |
Generation | Given an image, make compositions with certain requirements. | 145,000 | 11,315 | 17,350 |
Chinese | CAP, CLS, VQA, and GEN tasks in Chinese. | 192,076 | 77,306 | 4,100 |
Video | CAP, CLS, and VQA tasks on video-language datasets. | 20,868 | 7,542 | 9,294 |
Multi-lingual | Translated tasks in 80 languages | 0 | 240,000 | 184,000 |
Task | Dataset | #Train | #Val | #Test |
---|---|---|---|---|
Image Captioning | coco | 566,747 | 25,010 | 25,010 |
textcap | 97,765 | 13,965 | 0 | |
image-paragraph-captioning | 14,575 | 2,487 | 2,489 | |
Classification | coco-goi | 30,000 | 2,000 | 0 |
coco-text | 118,312 | 27,550 | 0 | |
imagenet | 30,000 | 50,000 | 0 | |
coco-itm | 30,000 | 5,000 | 5,000 | |
snli-ve | 20,000 | 14,339 | 14,740 | |
mocheg | 4,991 | 180 | 466 | |
iqa | 5,000 | 1,000 | 1,000 | |
Visual Question Answering | vqa-v2 | 30,000 | 30,000 | 0 |
shapes | 13,568 | 1,024 | 1,024 | |
docvqa | 39,463 | 5,349 | 0 | |
ocr-vqa | 11,414 | 4,940 | 0 | |
st-vqa | 26,074 | 0 | 4,070 | |
text-vqa | 27,113 | 0 | 5,734 | |
gqa | 30,001 | 5,001 | 0 | |
Knowledgeable Visual QA | okvqa | 9,009 | 5,046 | 0 |
a-okvqa | 17,056 | 1,145 | 0 | |
science-qa | 12,726 | 4,241 | 4,241 | |
viquae | 1,190 | 1,250 | 1,236 | |
Reasoning | clevr | 30,000 | 2,000 | 0 |
nlvr | 29,372 | 2,000 | 0 | |
vcr | 25,000 | 5,000 | 5,000 | |
visual-mrc | 15,000 | 2,500 | 5,000 | |
winoground | 0 | 0 | 800 | |
Generation | vist | 5,000 | 4,315 | 4,350 |
visual-dialog | 50,000 | 1,000 | 1,000 | |
multi30k | 90,000 | 6,000 | 12,000 | |
Chinese | fm-iqa | 164,735 | 75,206 | 0 |
coco-cn | 18,341 | 1,000 | 1,000 | |
flickr8k-cn | 6,000 | 1,000 | 1,000 | |
chinese-food | 0 | 0 | 1,100 | |
mmchat | 3,000 | 1,000 | 1,000 | |
Video | ss | 2,000 | 2,000 | 2,000 |
ivqa | 5,994 | 2,000 | 2,000 | |
msvd-qa | 1,161 | 245 | 504 | |
activitynet-qa | 3,200 | 1,800 | 800 | |
msrvtt | 6,513 | 497 | 2,990 | |
msrvtt-qa | 2,000 | 1,000 | 1,000 |
# OR run huggingface-cli login from huggingface_hub import login hf_token = "hf_xxx" # TODO: set a valid HuggingFace access token for loading datasets/models login(token=hf_token)
from datasets import load_dataset ds_name = "coco" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT", ds_name)
from datasets import load_dataset ds_name = "coco" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT", ds_name) train_set = dataset["train"] validation_set = dataset["validation"] test_set = dataset["test"]
from datasets import load_dataset from io import BytesIO from base64 import b64decode from PIL import Image ds_name = "coco" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT", ds_name) train_set = dataset["train"] for train_instance in train_set: instruction = train_instance["instruction"] # str inputs = train_instance["inputs"] # str outputs = train_instance["outputs"] # str image_base64_str_list = train_instance["image_base64_str"] # str (base64) image_0 = Image.open(BytesIO(b64decode(image_base64_str_list[0])))
import datasets features = datasets.Features( { "instruction": datasets.Value("string"), "inputs": datasets.Value("string"), "image_base64_str": [datasets.Value("string")], "outputs": datasets.Value("string"), } )
[More Information Needed]
Task | Dataset [Citation] | Source |
---|---|---|
Image Captioning | coco [1] | Source |
textcap [2] | Source | |
image-paragraph-captioning [3] | Source | |
Classification | coco-goi [1] | Source |
coco-text [4] | Source | |
imagenet [5] | Source | |
coco-itm [1] | Source | |
snli-ve [6] | Source | |
mocheg [7] | Source | |
iqa [8] | Source | |
Visual Question Answering | vqa-v2 [9] | Source |
shapes [10] | Source | |
docvqa [11] | Source | |
ocr-vqa [12] | Source | |
st-vqa [13] | Source | |
text-vqa [14] | Source | |
gqa [15] | Source | |
Knowledgeable Visual QA | okvqa [16] | Source |
a-okvqa [17] | Source | |
science-qa [18] | Source | |
viquae [19] | Source | |
Reasoning | clevr [20] | Source |
nlvr [21] | Source | |
vcr [22] | Source | |
visual-mrc [23] | Source | |
winoground [24] | Source | |
Generation | vist [25] | Source |
visual-dialog [26] | Source | |
multi30k [27] | Source | |
Chinese | fm-iqa [28] | Source |
coco-cn [29] | Source | |
flickr8k-cn [30] | Source | |
chinese-food [31] | Source | |
mmchat [32] | Source | |
Video | ss [33] | Source |
ivqa [34] | Source | |
msvd-qa [35] | Source | |
activitynet-qa [36] | Source | |
msrvtt [35] | Source | |
msrvtt-qa [37] | Source |
To build high-quality multimodal instruction datasets, we rewrite various datasets into multimodal-to-text dialog format. The annotation process includes four steps:
Eight authors of this work are employed as human annotators, each of whom is a graduate student familiar with relevant literature.
The content of original dataset follows their original license. We suggest that for the task with Unknown/Custom license, the user can check the original project or contact the dataset owner for detailed license information.
Our annotated instruction data is licensed under CC BY 4.0 .
@article{li2023m3it, title={M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning}, author={Lei Li and Yuwei Yin and Shicheng Li and Liang Chen and Peiyi Wang and Shuhuai Ren and Mukai Li and Yazheng Yang and Jingjing Xu and Xu Sun and Lingpeng Kong and Qi Liu}, journal={arXiv preprint arXiv:2306.04387}, year={2023} }
M3IT is an open-source, large-scale Multi-modal, Multilingual Instruction Tuning dataset, designed to enable the development of general-purpose multi-modal agents.