数据集:

MMInstruction/M3IT-80

许可:

other

大小:

size_categories:0.5M<n<1M

任务:

图像分类

图生文

数据集介绍文件清单

中文

Dataset Card for M3IT-80

Project Page: https://m3-it.github.io/

Languages

80 languages translated from English.

Dataset Metainfo

M3IT dataset compiles diverse tasks of classical vision-language tasks, including captioning, visual question answering~(VQA), visual conditioned generation, reasoning and classification. M3IT-80 is the 80-language translated version of M3IT.

Languages

_LAN_CODES = [
    "af", "am", "ar", "as", "ast", "be", "bg", "bn", "bs", "ca",
    "ceb", "cs", "cy", "da", "de", "el", "es", "et", "fi", "fr",
    "fuv", "gl", "gu", "ha", "he", "hi", "hr", "hu", "hy", "id",
    "ig", "is", "it", "ja", "jv", "ka", "kk", "km", "kn", "ko",
    "ky", "lb", "lg", "lij", "li", "ln", "lo", "lt", "lv", "mi",
    "mk", "ml", "mr", "mt", "my", "nl", "ny", "oc", "pa", "pl",
    "pt", "ro", "ru", "sd", "sk", "sn", "so", "sr", "sv", "ta",
    "te", "tg", "th", "tl", "tr", "uk", "ur", "vi", "wo", "zh",
]

Dataset Statistics

We report the number of the train/validation/test of each dataset per language.

Task	Dataset	#Train	#Val	#Test
Classification	imagenet	500	500	0
Visual Question Answering	vqa-v2	500	500	0
Knowledgeable Visual QA	okvqa	500	500	0
Reasoning	winoground	0	0	800
Generation	vist	500	500	500
Video	msrvtt	500	500	0
msrvtt-qa	500	500	0

Source Data

Source language: English

Task	Dataset [Citation]	Source
Classification	imagenet [1]	Source
Visual Question Answering	vqa-v2 [2]	Source
Knowledgeable Visual QA	okvqa [3]	Source
Reasoning	winoground [4]	Source
Generation	vist [5]	Source
Video	msrvtt [6]	Source
msrvtt-qa [7]	Source

Translation

We use free Alibaba Translate , a deep neural network translation (NMT) system, to perform the translation task.

Dataset Structure

HuggingFace Login (Optional)

# OR run huggingface-cli login
from huggingface_hub import login

hf_token = "hf_xxx"  # TODO: set a valid HuggingFace access token for loading datasets/models
login(token=hf_token)

Data Loading

from datasets import load_dataset

ds_name = "okvqa-zh"  # change the dataset name here
dataset = load_dataset("MMInstruction/M3IT-80", ds_name)

Data Splits

from datasets import load_dataset

ds_name = "okvqa-zh"  # change the dataset name here
dataset = load_dataset("MMInstruction/M3IT-80", ds_name)
train_set = dataset["train"]
validation_set = dataset["validation"]
test_set = dataset["test"]

Data Instances

from datasets import load_dataset
from io import BytesIO
from base64 import b64decode
from PIL import Image

ds_name = "okvqa-zh"  # change the dataset name here
dataset = load_dataset("MMInstruction/M3IT-80", ds_name)
train_set = dataset["train"]

for train_instance in train_set:
    instruction = train_instance["instruction"]  # str
    inputs = train_instance["inputs"]  # str
    outputs = train_instance["outputs"]  # str
    image_base64_str_list = train_instance["image_base64_str"]  # str (base64)
    image_0 = Image.open(BytesIO(b64decode(image_base64_str_list[0])))

Data Fields

import datasets

features = datasets.Features(
    {
        "instruction": datasets.Value("string"),
        "inputs": datasets.Value("string"),
        "image_base64_str": [datasets.Value("string")],
        "outputs": datasets.Value("string"),
    }
)

Licensing Information

The content of original dataset follows their original license. We suggest that for the task with Unknown/Custom license, the user can check the original project or contact the dataset owner for detailed license information.

Our annotated instruction data is licensed under CC BY 4.0 .

Citation Information

@article{li2023m3it,
  title={M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning},
  author={Lei Li and Yuwei Yin and Shicheng Li and Liang Chen and Peiyi Wang and Shuhuai Ren and Mukai Li and Yazheng Yang and Jingjing Xu and Xu Sun and Lingpeng Kong and Qi Liu},
  journal={arXiv preprint arXiv:2306.04387},
  year={2023}
}

Contributions

M3IT-80 is the translated version of M3IT, an open-source, large-scale Multi-modal, Multilingual Instruction Tuning dataset, designed to enable the development of general-purpose multi-modal agents.

References

[1] Imagenet large scale visual recognition challenge
[2] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
[3] OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
[4] WinoGround: Probing vision and language models for visio-linguistic compositionality
[5] Visual Storytelling
[6] Video Question Answering via Gradually Refined Attention over Appearance and Motion
[7] MSR-VTT: A large video description dataset for bridging video and language

作者:

MMInstruction

数据集大小:

5.26 GB