数据集:
NLPCoreTeam/mmlu_ru
MMLU dataset for EN/RU, without auxiliary train. The dataset contains dev / val / test splits for both, English and Russian languages. Note it doesn't include auxiliary_train split, which wasn't translated. Totally the dataset has ~16k samples per language: 285 dev , 1531 val , 14042 test .
MMLU dataset covers 57 different tasks. Each task requires to choose the right answer out of four options for a given question. Paper "Measuring Massive Multitask Language Understanding": https://arxiv.org/abs/2009.03300v3 . It is also known as the "hendrycks_test".
The translation was made via Yandex.Translate API. There are some translation mistakes, especially observed with terms and formulas, no fixes were applied. Initial dataset was taken from: https://people.eecs.berkeley.edu/~hendrycks/data.tar .
{ "question_en": "Why doesn't Venus have seasons like Mars and Earth do?", "choices_en": [ "Its rotation axis is nearly perpendicular to the plane of the Solar System.", "It does not have an ozone layer.", "It does not rotate fast enough.", "It is too close to the Sun." ], "answer": 0, "question_ru": "Почему на Венере нет времен года, как на Марсе и Земле?", "choices_ru": [ "Ось его вращения почти перпендикулярна плоскости Солнечной системы.", "У него нет озонового слоя.", "Он вращается недостаточно быстро.", "Это слишком близко к Солнцу." ] }
To merge all subsets into dataframe per split:
from collections import defaultdict import datasets import pandas as pd subjects = ["abstract_algebra", "anatomy", "astronomy", "business_ethics", "clinical_knowledge", "college_biology", "college_chemistry", "college_computer_science", "college_mathematics", "college_medicine", "college_physics", "computer_security", "conceptual_physics", "econometrics", "electrical_engineering", "elementary_mathematics", "formal_logic", "global_facts", "high_school_biology", "high_school_chemistry", "high_school_computer_science", "high_school_european_history", "high_school_geography", "high_school_government_and_politics", "high_school_macroeconomics", "high_school_mathematics", "high_school_microeconomics", "high_school_physics", "high_school_psychology", "high_school_statistics", "high_school_us_history", "high_school_world_history", "human_aging", "human_sexuality", "international_law", "jurisprudence", "logical_fallacies", "machine_learning", "management", "marketing", "medical_genetics", "miscellaneous", "moral_disputes", "moral_scenarios", "nutrition", "philosophy", "prehistory", "professional_accounting", "professional_law", "professional_medicine", "professional_psychology", "public_relations", "security_studies", "sociology", "us_foreign_policy", "virology", "world_religions"] splits = ["dev", "val", "test"] all_datasets = {x: datasets.load_dataset("NLPCoreTeam/mmlu_ru", name=x) for x in subjects} res = defaultdict(list) for subject in subjects: for split in splits: dataset = all_datasets[subject][split] df = dataset.to_pandas() int2str = dataset.features['answer'].int2str df['answer'] = df['answer'].map(int2str) df.insert(loc=0, column='subject_en', value=subject) res[split].append(df) res = {k: pd.concat(v) for k, v in res.items()} df_dev = res['dev'] df_val = res['val'] df_test = res['test']
This dataset is intended to evaluate LLMs with few-shot/zero-shot setup.
Evaluation code: https://github.com/NLP-Core-Team/mmlu_ru
Also resources might be helpful:
Dataset added by NLP core team RnD Telegram channel