数据集:

cais/mmlu

任务:

问答

子任务:

multiple-choice-qa

语言:

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

expert-generated

批注创建人:

no-annotation

源数据集:

original

预印本库:

arxiv:2009.03300 arxiv:2005.00700 arxiv:2005.14165

许可:

mit

数据集介绍文件清单

英文

MMLU 数据集卡片

数据集概要

由 Dan Hendrycks , Collin Burns , Steven Basart , Andy Zou, Mantas Mazeika, Dawn Song 和 Jacob Steinhardt (ICLR 2021) 创作的 Measuring Massive Multitask Language Understanding 。

这是一个包含来自各个知识领域的多项选择题的海量多任务测试。该测试涵盖了人文学科、社会科学、硬科学和其他对某些人学习来说很重要的领域。它包含57项任务，包括初等数学、美国历史、计算机科学、法律等。要在这个测试中获得高准确率，模型必须具备广泛的世界知识和问题解决能力。

任务的完整列表: ['abstract_algebra', 'anatomy', 'astronomy', 'business_ethics', 'clinical_knowledge', 'college_biology', 'college_chemistry', 'college_computer_science', 'college_mathematics', 'college_medicine', 'college_physics', 'computer_security', 'conceptual_physics', 'econometrics', 'electrical_engineering', 'elementary_mathematics', 'formal_logic', 'global_facts', 'high_school_biology', 'high_school_chemistry', 'high_school_computer_science', 'high_school_european_history', 'high_school_geography', 'high_school_government_and_politics', 'high_school_macroeconomics', 'high_school_mathematics', 'high_school_microeconomics', 'high_school_physics', 'high_school_psychology', 'high_school_statistics', 'high_school_us_history', 'high_school_world_history', 'human_aging', 'human_sexuality', 'international_law', 'jurisprudence', 'logical_fallacies', 'machine_learning', 'management', 'marketing', 'medical_genetics', 'miscellaneous', 'moral_disputes', 'moral_scenarios', 'nutrition', 'philosophy', 'prehistory', 'professional_accounting', 'professional_law', 'professional_medicine', 'professional_psychology', 'public_relations', 'security_studies', 'sociology', 'us_foreign_policy', 'virology', 'world_religions']

支持的任务和排行榜

Model	Authors	Humanities	Social Science	STEM	Other	Average
1238321	Khashabi et al., 2020	45.6	56.6	40.2	54.6	48.9
1239321 (few-shot)	Brown et al., 2020	40.8	50.4	36.7	48.8	43.9
12310321	Radford et al., 2019	32.8	33.3	30.2	33.1	32.4
Random Baseline	N/A	25.0	25.0	25.0	25.0	25.0

语言

英语

数据集结构

数据实例

解剖学子任务的示例如下:

{
  "question": "What is the embryological origin of the hyoid bone?",
  "choices": ["The first pharyngeal arch", "The first and second pharyngeal arches", "The second pharyngeal arch", "The second and third pharyngeal arches"],
  "answer": "D"
}

数据字段

question : 字符串特征
choices : 由4个字符串特征组成的列表
answer : ClassLabel 特征

数据拆分

auxiliary_train : 来自 ARC、MC_TEST、OBQA、RACE 等的辅助多项选择训练问题
dev : 每个子任务5个示例，用于少样本学习
test : 每个子任务至少有100个示例

auxiliary_train	dev	val	test
TOTAL	99842	285	1531	14042

数据集创建

策划理由

Transformer 模型通过对大量文本语料进行预训练，包括整个维基百科、数千本书籍和众多网站的内容，推动了最近的进展。这些模型因此能够看到关于专业主题的大量信息，其中大部分不会在现有的自然语言处理基准中进行评估。为了弥合预训练模型所看到的广泛知识和现有成功度量之间的差距，我们引入了一个新的基准，用于评估模型在各种人类学科中的能力。

数据源

Initial Data Collection and Normalization

[More Information Needed]

谁是源语言的制作者？

[More Information Needed]

注释

注释过程

[More Information Needed]

谁是注释者？

[More Information Needed]

个人和敏感信息

[More Information Needed]

使用数据的注意事项

数据集的社会影响

[More Information Needed]

偏见讨论

[More Information Needed]

其他已知限制

[More Information Needed]

附加信息

数据集策划者

[More Information Needed]

许可信息

MIT License

引用信息

如果您在研究中发现这个数据集有用，请考虑引用该测试以及它所依赖的 ETHICS 数据集:

    @article{hendryckstest2021,
      title={Measuring Massive Multitask Language Understanding},
      author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
      journal={Proceedings of the International Conference on Learning Representations (ICLR)},
      year={2021}
    }

    @article{hendrycks2021ethics,
      title={Aligning AI With Shared Human Values},
      author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
      journal={Proceedings of the International Conference on Learning Representations (ICLR)},
      year={2021}
    }

贡献者

感谢 @andyzoujm 添加了该数据集。

作者:

cais

数据集大小:

158.64 MB