数据集:

MMInstruction/M3IT

任务:

图生文

图像分类

大小:

1M<n<10M

许可:

other

数据集介绍文件清单

中文

Dataset Card for M3IT

Project Page: M3IT

Languages

English and Chinese. 80 translated version can be found at M3IT-80 .

Dataset Statistics

Our dataset compiles diverse tasks of classical vision-language tasks, including captioning, visual question answering~(VQA), visual conditioned generation, reasoning and classification.

Instruction Statistics

Task	#Instructions
Image Captioning	52
Classification	113
Visual Question Answering	95
Knowledgeable Visual QA	40
Reasoning	60
Generation	40
Total	400

Task Statistics

Task	Description	#Train	#Val	#Test
Image Captioning	Given an image, write a description for the image.	679,087	41,462	27,499
Classification	Given an image, classify the image into pre-defined categories.	238,303	100,069	21,206
Visual Question Answering	Given an image, answer a question relevant to the image.	177,633	46,314	10,828
Knowledgeable Visual QA	Given an image, answer the question requires outside knowledge.	39,981	11,682	5,477
Reasoning	Given an image, conduct reasoning over the images.	99,372	11,500	10,000
Generation	Given an image, make compositions with certain requirements.	145,000	11,315	17,350
Chinese	CAP, CLS, VQA, and GEN tasks in Chinese.	192,076	77,306	4,100
Video	CAP, CLS, and VQA tasks on video-language datasets.	20,868	7,542	9,294
Multi-lingual	Translated tasks in 80 languages	0	240,000	184,000

Detailed Dataset Statistics

Task	Dataset	#Train	#Val	#Test
Image Captioning	coco	566,747	25,010	25,010
textcap	97,765	13,965	0
image-paragraph-captioning	14,575	2,487	2,489
Classification	coco-goi	30,000	2,000	0
coco-text	118,312	27,550	0
imagenet	30,000	50,000	0
coco-itm	30,000	5,000	5,000
snli-ve	20,000	14,339	14,740
mocheg	4,991	180	466
iqa	5,000	1,000	1,000
Visual Question Answering	vqa-v2	30,000	30,000	0
shapes	13,568	1,024	1,024
docvqa	39,463	5,349	0
ocr-vqa	11,414	4,940	0
st-vqa	26,074	0	4,070
text-vqa	27,113	0	5,734
gqa	30,001	5,001	0
Knowledgeable Visual QA	okvqa	9,009	5,046	0
a-okvqa	17,056	1,145	0
science-qa	12,726	4,241	4,241
viquae	1,190	1,250	1,236
Reasoning	clevr	30,000	2,000	0
nlvr	29,372	2,000	0
vcr	25,000	5,000	5,000
visual-mrc	15,000	2,500	5,000
winoground	0	0	800
Generation	vist	5,000	4,315	4,350
visual-dialog	50,000	1,000	1,000
multi30k	90,000	6,000	12,000
Chinese	fm-iqa	164,735	75,206	0
coco-cn	18,341	1,000	1,000
flickr8k-cn	6,000	1,000	1,000
chinese-food	0	0	1,100
mmchat	3,000	1,000	1,000
Video	ss	2,000	2,000	2,000
ivqa	5,994	2,000	2,000
msvd-qa	1,161	245	504
activitynet-qa	3,200	1,800	800
msrvtt	6,513	497	2,990
msrvtt-qa	2,000	1,000	1,000

Dataset Structure

HuggingFace Login (Optional)

# OR run huggingface-cli login
from huggingface_hub import login

hf_token = "hf_xxx"  # TODO: set a valid HuggingFace access token for loading datasets/models
login(token=hf_token)

Data Loading

from datasets import load_dataset

ds_name = "coco"  # change the dataset name here
dataset = load_dataset("MMInstruction/M3IT", ds_name)

Data Splits

from datasets import load_dataset

ds_name = "coco"  # change the dataset name here
dataset = load_dataset("MMInstruction/M3IT", ds_name)
train_set = dataset["train"]
validation_set = dataset["validation"]
test_set = dataset["test"]

Data Instances

from datasets import load_dataset
from io import BytesIO
from base64 import b64decode
from PIL import Image

ds_name = "coco"  # change the dataset name here
dataset = load_dataset("MMInstruction/M3IT", ds_name)
train_set = dataset["train"]

for train_instance in train_set:
    instruction = train_instance["instruction"]  # str
    inputs = train_instance["inputs"]  # str
    outputs = train_instance["outputs"]  # str
    image_base64_str_list = train_instance["image_base64_str"]  # str (base64)
    image_0 = Image.open(BytesIO(b64decode(image_base64_str_list[0])))

Data Fields

import datasets

features = datasets.Features(
    {
        "instruction": datasets.Value("string"),
        "inputs": datasets.Value("string"),
        "image_base64_str": [datasets.Value("string")],
        "outputs": datasets.Value("string"),
    }
)

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Task	Dataset [Citation]	Source
Image Captioning	coco [1]	Source
textcap [2]	Source
image-paragraph-captioning [3]	Source
Classification	coco-goi [1]	Source
coco-text [4]	Source
imagenet [5]	Source
coco-itm [1]	Source
snli-ve [6]	Source
mocheg [7]	Source
iqa [8]	Source
Visual Question Answering	vqa-v2 [9]	Source
shapes [10]	Source
docvqa [11]	Source
ocr-vqa [12]	Source
st-vqa [13]	Source
text-vqa [14]	Source
gqa [15]	Source
Knowledgeable Visual QA	okvqa [16]	Source
a-okvqa [17]	Source
science-qa [18]	Source
viquae [19]	Source
Reasoning	clevr [20]	Source
nlvr [21]	Source
vcr [22]	Source
visual-mrc [23]	Source
winoground [24]	Source
Generation	vist [25]	Source
visual-dialog [26]	Source
multi30k [27]	Source
Chinese	fm-iqa [28]	Source
coco-cn [29]	Source
flickr8k-cn [30]	Source
chinese-food [31]	Source
mmchat [32]	Source
Video	ss [33]	Source
ivqa [34]	Source
msvd-qa [35]	Source
activitynet-qa [36]	Source
msrvtt [35]	Source
msrvtt-qa [37]	Source

Annotations

Annotation process

To build high-quality multimodal instruction datasets, we rewrite various datasets into multimodal-to-text dialog format. The annotation process includes four steps:

(1) Stage I: Instruction Writing : writing instructions for each task;
(2) Stage II: Data Format Unification : structuring images and texts into a unified schema;
(3) Stage III: Quality Check : checking the overall dataset quality;
(4) Stage IV: Key Datasets Translation : building multilingual sets.

Who are the annotators?

Eight authors of this work are employed as human annotators, each of whom is a graduate student familiar with relevant literature.

Additional Information

Licensing Information

The content of original dataset follows their original license. We suggest that for the task with Unknown/Custom license, the user can check the original project or contact the dataset owner for detailed license information.

Our annotated instruction data is licensed under CC BY 4.0 .

Citation Information

@article{li2023m3it,
  title={M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning},
  author={Lei Li and Yuwei Yin and Shicheng Li and Liang Chen and Peiyi Wang and Shuhuai Ren and Mukai Li and Yazheng Yang and Jingjing Xu and Xu Sun and Lingpeng Kong and Qi Liu},
  journal={arXiv preprint arXiv:2306.04387},
  year={2023}
}

Contributions

M3IT is an open-source, large-scale Multi-modal, Multilingual Instruction Tuning dataset, designed to enable the development of general-purpose multi-modal agents.

References

[1] Microsoft COCO: Common Objects in Context
[2] TextCaps: a dataset for image captioning with reading comprehension
[3] A Hierarchical Approach for Generating Descriptive Image Paragraphs
[4] COCO-Text: Dataset and benchmark for text detection and recognition in natural images
[5] Imagenet large scale visual recognition challenge
[6] E-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks
[7] End-to-End Multimodal Fact-Checking and Explanation Generation: A Challenging Dataset and Models
[8] Quantifying visual image quality: A Bayesian view
[9] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
[10] Neural Module Networks
[11] DocVQA: A dataset for vqa on document images
[12] OCR-VQA: Visual Question Answering by Reading Text in Images
[13] Scene Text Visual Question Answering
[14] Towards VQA Models That Can Read
[15] GQA: A new dataset for real-world visual reasoning and compositional question answering
[16] OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
[17] A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge
[18] Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
[19] ViQuAE: a dataset for knowledge-based visual question answering about named entities
[20] CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning
[21] A Corpus of Natural Language for Visual Reasoning
[22] From recognition to cognition: Visual Commonsense Reasoning
[23] VisualMRC: Machine reading comprehension on document images
[24] WinoGround: Probing vision and language models for visio-linguistic compositionality
[25] Visual Storytelling
[26] Visual Dialog
[27] Multi30k: Multilingual english-german image descriptions
[28] Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question
[29] COCO-CN for cross-lingual image tagging, captioning, and retrieval
[30] Adding Chinese Captions to Images
[31] ChineseFoodNet: A large-scale image dataset for chinese food recognition
[32] MMChat: Multi-Modal Chat Dataset on Social Media
[33] The "Something Something" Video Database for Learning and Evaluating Visual Common Sense
[34] Just Ask: Learning to answer questions from millions of narrated videos
[35] Video Question Answering via Gradually Refined Attention over Appearance and Motion
[36] ActivityNet-qa: A dataset for understanding complex web videos via question answering
[37] MSR-VTT: A large video description dataset for bridging video and language

作者:

MMInstruction

数据集大小:

241.48 GB