数据集:
textvqa
任务:
视觉问答语言:
en计算机处理:
monolingual大小:
10K<n<100K语言创建人:
crowdsourced批注创建人:
crowdsourced源数据集:
original许可:
cc-by-4.0TextVQA requires models to read and reason about text in images to answer questions about them. Specifically, models need to incorporate a new modality of text present in the images and reason over it to answer TextVQA questions. TextVQA dataset contains 45,336 questions over 28,408 images from the OpenImages dataset. The dataset uses VQA accuracy metric for evaluation.
The questions in the dataset are in English.
A typical sample mainly contains the question in question field, an image object in image field, OpenImage image id in image_id and lot of other useful metadata. 10 answers per questions are contained in the answers attribute. For test set, 10 empty strings are contained in the answers field as the answers are not available for it.
An example look like below:
{'question': 'who is this copyrighted by?', 'image_id': '00685bc495504d61', 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x276021C5EB8>, 'image_classes': ['Vehicle', 'Tower', 'Airplane', 'Aircraft'], 'flickr_original_url': 'https://farm2.staticflickr.com/5067/5620759429_4ea686e643_o.jpg', 'flickr_300k_url': 'https://c5.staticflickr.com/6/5067/5620759429_f43a649fb5_z.jpg', 'image_width': 786, 'image_height': 1024, 'answers': ['simon clancy', 'simon ciancy', 'simon clancy', 'simon clancy', 'the brand is bayard', 'simon clancy', 'simon clancy', 'simon clancy', 'simon clancy', 'simon clancy'], 'question_tokens': ['who', 'is', 'this', 'copyrighted', 'by'], 'question_id': 3, 'set_name': 'train' },
There are three splits. train , validation and test . The train and validation sets share images with OpenImages train set and have their answers available. For test set answers, we return a list of ten empty strings. To get inference results and numbers on test set, you need to go to the EvalAI leaderboard and upload your predictions there. Please see instructions at https://textvqa.org/challenge/ .
From the paper:
Studies have shown that a dominant class of questions asked by visually impaired users on images of their surroundings involves reading text in the image. But today’s VQA models can not read! Our paper takes a first step towards addressing this problem. First, we introduce a new “TextVQA” dataset to facilitate progress on this important problem. Existing datasets either have a small proportion of questions about text (e.g., the VQA dataset) or are too small (e.g., the VizWiz dataset). TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer.
The initial images were sourced from OpenImages v4 dataset. These were first filtered based on automatic heuristics using an OCR system where we only took images which had at least some text detected in them. See annotation process section to understand the next stages.
Who are the source language producers?English Crowdsource Annotators
After the automatic process of filter the images that contain text, the images were manually verified using human annotators making sure that they had text. In next stage, the annotators were asked to write questions involving scene text for the image. For some images, in this stage, two questions were collected whenever possible. Finally, in the last stage, ten different human annotators answer the questions asked in last stage.
Who are the annotators?Annotators are from one of the major data collection platforms such as AMT. Exact details are not mentioned in the paper.
The dataset does have similar PII issues as OpenImages and can at some times contain human faces, license plates, and documents. Using provided image_classes data field is one option to try to filter out some of this information.
The paper helped realize the importance of scene text recognition and reasoning in general purpose machine learning applications and has led to many follow-up works including TextCaps and TextOCR . Similar datasets were introduced over the time which specifically focus on sight-disabled users such as VizWiz or focusing specifically on the same problem as TextVQA like STVQA , DocVQA and OCRVQA . Currently, most methods train on combined dataset from TextVQA and STVQA to achieve state-of-the-art performance on both datasets.
Question-only bias where a model is able to answer the question without even looking at the image is discussed in the paper which was a major issue with original VQA dataset. The outlier bias in answers is prevented by collecting 10 different answers which are also taken in consideration by the evaluation metric.
CC by 4.0
@inproceedings{singh2019towards, title={Towards VQA Models That Can Read}, author={Singh, Amanpreet and Natarjan, Vivek and Shah, Meet and Jiang, Yu and Chen, Xinlei and Batra, Dhruv and Parikh, Devi and Rohrbach, Marcus}, booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition}, pages={8317-8326}, year={2019} }
Thanks to @apsdehal for adding this dataset.