数据集:
kakaobrain/coyo-700m
子任务:
image-captioning语言:
en计算机处理:
monolingual大小:
100M<n<1B语言创建人:
other批注创建人:
no-annotation源数据集:
original许可:
cc-by-4.0COYO-700M is a large-scale dataset that contains 747M image-text pairs as well as many other meta-attributes to increase the usability to train various models. Our dataset follows a similar strategy to previous vision-and-language datasets, collecting many informative pairs of alt-text and its associated image in HTML documents. We expect COYO to be used to train popular large-scale foundation models complementary to other similar datasets. For more details on the data acquisition process, please refer to the technical paper to be released later.
We empirically validated the quality of COYO dataset by re-implementing popular models such as ALIGN , unCLIP , and ViT . We trained these models on COYO-700M or its subsets from scratch, achieving competitive performance to the reported numbers or generated samples in the original papers. Our pre-trained models and training codes will be released soon along with the technical paper.
The texts in the COYO-700M dataset consist of English.
Each instance in COYO-700M represents single image-text pair information with meta-attributes:
{ 'id': 841814333321, 'url': 'https://blog.dogsof.com/wp-content/uploads/2021/03/Image-from-iOS-5-e1614711641382.jpg', 'text': 'A Pomsky dog sitting and smiling in field of orange flowers', 'width': 1000, 'height': 988, 'image_phash': 'c9b6a7d8469c1959', 'text_length': 59, 'word_count': 11, 'num_tokens_bert': 13, 'num_tokens_gpt': 12, 'num_faces': 0, 'clip_similarity_vitb32': 0.4296875, 'clip_similarity_vitl14': 0.35205078125, 'nsfw_score_opennsfw2': 0.00031447410583496094, 'nsfw_score_gantman': 0.03298913687467575, 'watermark_score': 0.1014641746878624, 'aesthetic_score_laion_v2': 5.435476303100586 }
name | type | description |
---|---|---|
id | long | Unique 64-bit integer ID generated by monotonically_increasing_id() |
url | string | The image URL extracted from the src attribute of the <img> tag |
text | string | The text extracted from the alt attribute of the <img> tag |
width | integer | The width of the image |
height | integer | The height of the image |
image_phash | string | The perceptual hash(pHash) of the image |
text_length | integer | The length of the text |
word_count | integer | The number of words separated by spaces. |
num_tokens_bert | integer | The number of tokens using BertTokenizer |
num_tokens_gpt | integer | The number of tokens using GPT2TokenizerFast |
num_faces | integer | The number of faces in the image detected by SCRFD |
clip_similarity_vitb32 | float | The cosine similarity between text and image(ViT-B/32) embeddings by OpenAI CLIP |
clip_similarity_vitl14 | float | The cosine similarity between text and image(ViT-L/14) embeddings by OpenAI CLIP |
nsfw_score_opennsfw2 | float | The NSFW score of the image by OpenNSFW2 |
nsfw_score_gantman | float | The NSFW score of the image by GantMan/NSFW |
watermark_score | float | The watermark probability of the image by our internal model |
aesthetic_score_laion_v2 | float | The aesthetic score of the image by LAION-Aesthetics-Predictor-V2 |
Data was not split, since the evaluation was expected to be performed on more widely used downstream task(s).
Similar to most vision-and-language datasets, our primary goal in the data creation process is to collect many pairs of alt-text and image sources in HTML documents crawled from the web. Therefore, We attempted to eliminate uninformative images or texts with minimal cost and improve our dataset's usability by adding various meta-attributes. Users can use these meta-attributes to sample a subset from COYO-700M and use it to train the desired model. For instance, the num_faces attribute could be used to make a subset like COYO-Faces and develop a privacy-preserving generative model.
We collected about 10 billion pairs of alt-text and image sources in HTML documents in CommonCrawl from Oct. 2020 to Aug. 2021. and eliminated uninformative pairs through the image and/or text level filtering process with minimal cost.
Image Level
Text Level
Image-Text Level
Common Crawl is the data source for COYO-700M.
The dataset was built in a fully automated process that did not require human annotation.
Who are the annotators?No human annotation
The COYO dataset is recommended to be used for research purposes. Kakao Brain tried to construct a "Safe" dataset when building the COYO dataset. (See Data Filtering Section) Kakao Brain is constantly making efforts to create more "Safe" datasets. However, despite these efforts, this large-scale dataset was not hand-picked by humans to avoid the risk due to its very large size (over 700M). Keep in mind that the unscreened nature of the dataset means that the collected images can lead to strongly discomforting and disturbing content for humans. The COYO dataset may contain some inappropriate data, and any problems resulting from such data are the full responsibility of the user who used it. Therefore, it is strongly recommended that this dataset be used only for research, keeping this in mind when using the dataset, and Kakao Brain does not recommend using this dataset as it is without special processing to clear inappropriate data to create commercial products.
It will be described in a paper to be released soon.
It will be described in a paper to be released soon.
It will be described in a paper to be released soon.
COYO dataset was released as an open source in the hope that it will be helpful to many research institutes and startups for research purposes. We look forward to contacting us from various places who wish to cooperate with us.
coyo@kakaobrain.com
The COYO dataset of Kakao Brain is licensed under CC-BY-4.0 License . The full license can be found in the LICENSE.cc-by-4.0 file . The dataset includes “Image URL” and “Text” collected from various sites by analyzing Common Crawl data, an open data web crawling project. The collected data (images and text) is subject to the license to which each content belongs.
Obligation to useWhile Open Source may be free to use, that does not mean it is free of obligation. To determine whether your intended use of the COYO dataset is suitable for the CC-BY-4.0 license, please consider the license guide. If you violate the license, you may be subject to legal action such as the prohibition of use or claim for damages depending on the use.
If you apply this dataset to any project and research, please cite our code:
@misc{kakaobrain2022coyo-700m, title = {COYO-700M: Image-Text Pair Dataset}, author = {Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, Saehoon Kim}, year = {2022}, howpublished = {\url{https://github.com/kakaobrain/coyo-dataset}}, }