Dataset type: LLaVA Visual Instruct CC3M Pretrain 595K is a subset of CC-3M dataset, filtered with a more balanced concept coverage distribution. Captions are also associated with BLIP synthetic caption for reference. It is constructed for the pretraining stage for feature alignment in visual instruction tuning. We aim to build large multimodal towards GPT-4 vision/language capability.
Dataset date: LLaVA Visual Instruct CC3M Pretrain 595K was created in April 2023.
Dataset structure:
Paper or resources for more information: https://llava-vl.github.io/
License: Must comply with license of CC-3M , BLIP (if you use their synthetic caption).
CC-3M The dataset may be freely used for any purpose, although acknowledgement of Google LLC ("Google") as the data source would be appreciated. The dataset is provided "AS IS" without any warranty, express or implied. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Where to send questions or comments about the model: https://github.com/haotian-liu/LLaVA/issues
Primary intended uses: The primary use of LLaVA is research on large multimodal models and chatbots.
Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.