数据集:
MMInstruction/M3IT-80
Project Page: https://m3-it.github.io/
80 languages translated from English.
M3IT dataset compiles diverse tasks of classical vision-language tasks, including captioning, visual question answering~(VQA), visual conditioned generation, reasoning and classification. M3IT-80 is the 80-language translated version of M3IT.
_LAN_CODES = [ "af", "am", "ar", "as", "ast", "be", "bg", "bn", "bs", "ca", "ceb", "cs", "cy", "da", "de", "el", "es", "et", "fi", "fr", "fuv", "gl", "gu", "ha", "he", "hi", "hr", "hu", "hy", "id", "ig", "is", "it", "ja", "jv", "ka", "kk", "km", "kn", "ko", "ky", "lb", "lg", "lij", "li", "ln", "lo", "lt", "lv", "mi", "mk", "ml", "mr", "mt", "my", "nl", "ny", "oc", "pa", "pl", "pt", "ro", "ru", "sd", "sk", "sn", "so", "sr", "sv", "ta", "te", "tg", "th", "tl", "tr", "uk", "ur", "vi", "wo", "zh", ]
We report the number of the train/validation/test of each dataset per language.
Task | Dataset | #Train | #Val | #Test |
---|---|---|---|---|
Classification | imagenet | 500 | 500 | 0 |
Visual Question Answering | vqa-v2 | 500 | 500 | 0 |
Knowledgeable Visual QA | okvqa | 500 | 500 | 0 |
Reasoning | winoground | 0 | 0 | 800 |
Generation | vist | 500 | 500 | 500 |
Video | msrvtt | 500 | 500 | 0 |
msrvtt-qa | 500 | 500 | 0 |
Source language: English
Task | Dataset [Citation] | Source |
---|---|---|
Classification | imagenet [1] | Source |
Visual Question Answering | vqa-v2 [2] | Source |
Knowledgeable Visual QA | okvqa [3] | Source |
Reasoning | winoground [4] | Source |
Generation | vist [5] | Source |
Video | msrvtt [6] | Source |
msrvtt-qa [7] | Source |
We use free Alibaba Translate , a deep neural network translation (NMT) system, to perform the translation task.
# OR run huggingface-cli login from huggingface_hub import login hf_token = "hf_xxx" # TODO: set a valid HuggingFace access token for loading datasets/models login(token=hf_token)
from datasets import load_dataset ds_name = "okvqa-zh" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT-80", ds_name)
from datasets import load_dataset ds_name = "okvqa-zh" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT-80", ds_name) train_set = dataset["train"] validation_set = dataset["validation"] test_set = dataset["test"]
from datasets import load_dataset from io import BytesIO from base64 import b64decode from PIL import Image ds_name = "okvqa-zh" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT-80", ds_name) train_set = dataset["train"] for train_instance in train_set: instruction = train_instance["instruction"] # str inputs = train_instance["inputs"] # str outputs = train_instance["outputs"] # str image_base64_str_list = train_instance["image_base64_str"] # str (base64) image_0 = Image.open(BytesIO(b64decode(image_base64_str_list[0])))
import datasets features = datasets.Features( { "instruction": datasets.Value("string"), "inputs": datasets.Value("string"), "image_base64_str": [datasets.Value("string")], "outputs": datasets.Value("string"), } )
The content of original dataset follows their original license. We suggest that for the task with Unknown/Custom license, the user can check the original project or contact the dataset owner for detailed license information.
Our annotated instruction data is licensed under CC BY 4.0 .
@article{li2023m3it, title={M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning}, author={Lei Li and Yuwei Yin and Shicheng Li and Liang Chen and Peiyi Wang and Shuhuai Ren and Mukai Li and Yazheng Yang and Jingjing Xu and Xu Sun and Lingpeng Kong and Qi Liu}, journal={arXiv preprint arXiv:2306.04387}, year={2023} }
M3IT-80 is the translated version of M3IT, an open-source, large-scale Multi-modal, Multilingual Instruction Tuning dataset, designed to enable the development of general-purpose multi-modal agents.