数据集:
wikimedia/wit_base
子任务:
image-captioning计算机处理:
multilingual大小:
1M<n<10M语言创建人:
found批注创建人:
machine-generated许可:
cc-by-sa-4.0Wikimedia's version of the Wikipedia-based Image Text (WIT) Dataset, a large multimodal multilingual dataset.
From the official blog post :
The core training data is taken from the Wikipedia Image-Text (WIT) Dataset, a large curated set of more than 37 million image-text associations extracted from Wikipedia articles in 108 languages that was recently released by Google Research.
The WIT dataset offers extremely valuable data about the pieces of text associated with Wikipedia images. However, due to licensing and data volume issues, the Google dataset only provides the image name and corresponding URL for download and not the raw image files.
Getting easy access to the image files is crucial for participants to successfully develop competitive models. Therefore, today, the Wikimedia Research team is releasing its first large image dataset. It contains more than six million image files from Wikipedia articles in 100+ languages, which correspond to almost [1] all captioned images in the WIT dataset. Image files are provided at a 300-px resolution, a size that is suitable for most of the learning frameworks used to classify and analyze images.
[1] We are publishing all images having a non-null “reference description” in the WIT dataset. For privacy reasons, we are not publishing images where a person is the primary subject, i.e., where a person’s face covers more than 10% of the image surface. To identify faces and their bounding boxes, we use the RetinaFace detector. In addition, to avoid the inclusion of inappropriate images or images that violate copyright constraints, we have removed all images that are candidate for deletion on Commons from the dataset.
Note : Compared to Google's version , which has contents of one Wikipedia page per data sample, this version groups contents of all Wikipedia pages available in different languages for the image in one single data sample to avoid duplication of image bytes.
image-captioning : This dataset can be used to train a model for image captioning where the goal is to predict a caption given the image.
text-retrieval : The goal in this task is to build a model that retrieves the text ( caption_title_and_reference_description ) closest to an image. The leaderboard for this task can be found here . This task also has a competition on Kaggle .
In these tasks, any combination of the caption_reference_description , caption_attribution_description and caption_alt_text_description fields can be used as the input text/caption.
The dataset contains examples from all Wikipedia languages.
Each instance is an image, its representation in bytes, a pre-computed embedding, and the set of captions attached to the image in Wikipedia.
{ 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=300x225 at 0x7F88F3876358>, 'image_url': 'https://upload.wikimedia.org/wikipedia/commons/8/8b/Scolopendra_gigantea.jpg', 'embedding': [1.4784087, 2.8710432, 0.0, 0.51603067, ..., 10.266883, 0.51142216, 0.0, 2.3464653], 'metadata_url': 'http://commons.wikimedia.org/wiki/File:Scolopendra_gigantea.jpg', 'original_height': 3000, 'original_width': 4000, 'mime_type': 'image/jpeg', 'caption_attribution_description': 'English: Puerto Rican Giant Centipede, Scolopendra gigantea; Vieques, Puerto Rico Slovenčina: Stonožka obrovská, Scolopendra gigantea; Vieques, Portoriko', 'wit_features': { 'language': ['ro', 'vi', 'sk', ..., 'nl', 'th', 'lv'], 'page_url': ['https://ro.wikipedia.org/wiki/Scolopendra_gigantea', 'https://vi.wikipedia.org/wiki/Scolopendra_gigantea', 'https://sk.wikipedia.org/wiki/Scolopendra_gigantea', ..., 'https://nl.wikipedia.org/wiki/Scolopendra_gigantea', 'https://th.wikipedia.org/wiki/%E0%B8%95%E0%B8%B0%E0%B8%82%E0%B8%B2%E0%B8%9A%E0%B8%A2%E0%B8%B1%E0%B8%81%E0%B8%A9%E0%B9%8C%E0%B8%82%E0%B8%B2%E0%B9%80%E0%B8%AB%E0%B8%A5%E0%B8%B7%E0%B8%AD%E0%B8%87%E0%B9%80%E0%B8%9B%E0%B8%A3%E0%B8%B9', 'https://lv.wikipedia.org/wiki/Skolopendru_dzimta'], 'attribution_passes_lang_id': [True, True, True, ..., True, True, True], 'caption_alt_text_description': [None, None, None, ..., 'Scolopendra gigantea', None, 'Milzu skolopendra (Scolopendra gigantea)'], 'caption_reference_description': [None, None, None, ..., None, None, 'Milzu skolopendra (Scolopendra gigantea)'], 'caption_title_and_reference_description': [None, 'Scolopendra gigantea [SEP] ', None, ..., 'Scolopendra gigantea [SEP] ', None, 'Skolopendru dzimta [SEP] Milzu skolopendra (Scolopendra gigantea)'], 'context_page_description': ['Scolopendra gigantea este un miriapod din clasa Chilopoda, fiind cel mai mare reprezentant al genului Scolopendra. Adultul poate atinge o lungime de 26 cm, uneori depășind 30 cm. Această specie habitează în regiunile de nord și de vest a Americii de Sud, pe insulele Trinidad, insulele Virgine, Jamaica Hispaniola ș.a. Localnicii denumesc scolopendra chilopodul gigant galben și chilopodul gigant amazonian.', 'Scolopendra gigantea là đại diện lớn nhất của chi Scolopendra nói riêng và cả lớp rết nói chung, thường đạt độ dài 26 cm và có thể vượt quá 30 cm. Sinh sống ở khu vực phía bắc và tây của Nam Mỹ và các đảo Trinidad, Puerto Rico, Saint Thomas, U.S. Virgin Islands, Jamaica, và Hispaniola.', 'Scolopendra gigantea, starší slovenský nazov: štípavica veľká, je živočích z rodu Scolopendra, s veľkosťou do 30 cm.', ..., 'Scolopendra gigantea is een tijgerduizendpoot uit Zuid-Amerika. De soort jaagt onder andere op grote geleedpotigen, amfibieën, reptielen en kleine zoogdieren. Het is voor zover bekend de grootste niet uitgestorven duizendpoot ter wereld.', 'ตะขาบยักษ์ขาเหลืองเปรู หรือ ตะขาบยักษ์อเมซอน เป็นตะขาบชนิดที่มีขนาดใหญ่ที่สุดในสกุล Scolopendra โดยปกติเมื่อโตเต็มที่จะยาว 26 เซนติเมตร แต่บางครั้งก็สามารถโตได้ถึง 30 เซนติเมตร ตะขาบชนิดนี้อาศัยอยู่ทางแถบเหนือและตะวันตกของทวีปอเมริกาใต้ และตามเกาะแก่งของประเทศตรินิแดดและจาไมกา เป็นสัตว์กินเนื้อ โดยกินจิ้งจก, กบ, นก, หนู และแม้แต่ค้างคาวเป็นอาหาร และขึ้นชื่อในเรื่องความดุร้าย', 'Skolpendru dzimta pieder pie simtkāju kārtas. Ap 400 dzimtas sugas sastopamas visā pasaulē, īpaši subtropu un tropu apgabalos. Mitinās augsnē, nobirušās lapās, plaisās, spraugās.'], 'context_section_description': [None, 'Scolopendra gigantea (còn được gọi là Rết chân vàng khổng lồ Peru và Rết khổng lồ Amazon) là đại diện lớn nhất của chi Scolopendra nói riêng và cả lớp rết nói chung, thường đạt độ dài 26\xa0cm (10\xa0in) và có thể vượt quá 30\xa0cm (12\xa0in). Sinh sống ở khu vực phía bắc và tây của Nam Mỹ và các đảo Trinidad, Puerto Rico, Saint Thomas, U.S. Virgin Islands, Jamaica, và Hispaniola.', None, ..., 'Scolopendra gigantea is een tijgerduizendpoot uit Zuid-Amerika. De soort jaagt onder andere op grote geleedpotigen, amfibieën, reptielen en kleine zoogdieren. Het is voor zover bekend de grootste niet uitgestorven duizendpoot ter wereld.', None, 'Skolpendru dzimta (Scolopendridae) pieder pie simtkāju kārtas. Ap 400 dzimtas sugas sastopamas visā pasaulē, īpaši subtropu un tropu apgabalos. Mitinās augsnē, nobirušās lapās, plaisās, spraugās.'], 'hierarchical_section_title': ['Scolopendra gigantea', 'Scolopendra gigantea', 'Scolopendra gigantea', ..., 'Scolopendra gigantea', 'ตะขาบยักษ์ขาเหลืองเปรู', 'Skolopendru dzimta'], 'is_main_image': [True, True, True, ..., True, True, True], 'page_title': ['Scolopendra gigantea', 'Scolopendra gigantea', 'Scolopendra gigantea', ..., 'Scolopendra gigantea', 'ตะขาบยักษ์ขาเหลืองเปรู', 'Skolopendru dzimta'], 'section_title': [None, None, None, ..., None, None, None] } }
Note : The dataset is stored in Parquet for better performance. This dataset was generated from the original files using this script . Additionally, 120 examples from the original files have incorrectly formatted one or more of the following fields: original_height , original_width , mime_type and caption_attribution_description . The fixed versions of these examples that were used in the generation script can be found here .
Figure: WIT annotation example.
Details on the field content can be found directly in the paper, figure 5 and table 12.
All data is held in train split, with a total of 6477255 examples.
From the official blog post :
The WIT dataset offers extremely valuable data about the pieces of text associated with Wikipedia images.
Getting easy access to the image files is crucial for participants to successfully develop competitive models.
With this large release of visual data, we aim to help the competition participants—as well as researchers and practitioners who are interested in working with Wikipedia images—find and download the large number of image files associated with the challenge, in a compact form.
From the paper, section 3.1 :
We started with all Wikipedia content pages (i.e., ignoring other pages that have discussions, comments and such). These number about ~124M pages across 279 languages.
Who are the source language producers?Text was extracted from Wikipedia.
WIT was constructed using an automatic process. However it was human-validated.
From the paper, section 3.7 :
To further verify the quality of the WIT dataset we performed a study using (crowd-sourced) human annotators. As seen in Fig. 3, we asked raters to answer 3 questions. Given an image and the page title, raters first evaluate the quality of the attribution description and reference description in the first two questions (order randomized). The third question understands the contextual quality of these text descriptions given the page description and caption. Each response is on a 3-point scale: "Yes" if the text perfectly describes the image, "Maybe" if it is sufficiently explanatory and "No" if it is irrelevant or the image is inappropriate.
Who are the annotators?[More Information Needed]
From the official blog post :
For privacy reasons, we are not publishing images where a person is the primary subject, i.e., where a person’s face covers more than 10% of the image surface. To identify faces and their bounding boxes, we use the RetinaFace detector. In addition, to avoid the inclusion of inappropriate images or images that violate copyright constraints, we have removed all images that are candidate for deletion on Commons from the dataset.
[More Information Needed]
From the paper, section 3.4 :
Lastly we found that certain image-text pairs occurred very frequently. These were often generic images that did not have much to do with the main article page. Common examples included flags, logos, maps, insignia and such. To prevent biasing the data, we heavily under-sampled all such images
[More Information Needed]
Miriam Redi, Fabian Kaelin and Tiziano Piccardi.
CC BY-SA 4.0 international license
@article{srinivasan2021wit, title={WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning}, author={Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc}, journal={arXiv preprint arXiv:2103.01913}, year={2021} }
Thanks to @nateraw , yjernite and mariosasko for adding this dataset.