数据集:
derek-thomas/ScienceQA
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Multi-modal Multiple Choice
English
Explore more samples here .
{'image': Image, 'question': 'Which of these states is farthest north?', 'choices': ['West Virginia', 'Louisiana', 'Arizona', 'Oklahoma'], 'answer': 0, 'hint': '', 'task': 'closed choice', 'grade': 'grade2', 'subject': 'social science', 'topic': 'geography', 'category': 'Geography', 'skill': 'Read a map: cardinal directions', 'lecture': 'Maps have four cardinal directions, or main directions. Those directions are north, south, east, and west.\nA compass rose is a set of arrows that point to the cardinal directions. A compass rose usually shows only the first letter of each cardinal direction.\nThe north arrow points to the North Pole. On most maps, north is at the top of the map.', 'solution': 'To find the answer, look at the compass rose. Look at which way the north arrow is pointing. West Virginia is farthest north.'}
Some records might be missing any or all of image, lecture, solution.
Note that the descriptions can be initialized with the Show Markdown Data Fields output of the Datasets Tagging app , you will then only need to refine the generated descriptions.
When answering a question, humans utilize the information available across different modalities to synthesize a consistent and complete chain of thought (CoT). This process is normally a black box in the case of deep learning models like large-scale language models. Recently, science question benchmarks have been used to diagnose the multi-hop reasoning ability and interpretability of an AI system. However, existing datasets fail to provide annotations for the answers, or are restricted to the textual-only modality, small scales, and limited domain diversity. To this end, we present Science Question Answering (ScienceQA).
ScienceQA is collected from elementary and high school science curricula.
Initial Data Collection and NormalizationSee Below
Who are the source language producers?See Below
Questions in the ScienceQA dataset are sourced from open resources managed by IXL Learning, an online learning platform curated by experts in the field of K-12 education. The dataset includes problems that align with California Common Core Content Standards. To construct ScienceQA, we downloaded the original science problems and then extracted individual components (e.g. questions, hints, images, options, answers, lectures, and solutions) from them based on heuristic rules. We manually removed invalid questions, such as questions that have only one choice, questions that contain faulty data, and questions that are duplicated, to comply with fair use and transformative use of the law. If there were multiple correct answers that applied, we kept only one correct answer. Also, we shuffled the answer options of each question to ensure the choices do not follow any specific pattern. To make the dataset easy to use, we then used semi-automated scripts to reformat the lectures and solutions. Therefore, special structures in the texts, such as tables and lists, are easily distinguishable from simple text passages. Similar to ImageNet, ReClor, and PMR datasets, ScienceQA is available for non-commercial research purposes only and the copyright belongs to the original authors. To ensure data quality, we developed a data exploration tool to review examples in the collected dataset, and incorrect annotations were further manually revised by experts. The tool can be accessed at https://scienceqa.github.io/explore.html .
Annotation processSee above
Who are the annotators?See above
From:
Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Provide the BibTex -formatted reference for the dataset. For example:
@inproceedings{lu2022learn, title={Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering}, author={Lu, Pan and Mishra, Swaroop and Xia, Tony and Qiu, Liang and Chang, Kai-Wei and Zhu, Song-Chun and Tafjord, Oyvind and Clark, Peter and Ashwin Kalyan}, booktitle={The 36th Conference on Neural Information Processing Systems (NeurIPS)}, year={2022} }
Thanks to Derek Thomas @datavistics for adding this dataset.