数据集:

NLPCoreTeam/mmlu_ru

中文

MMLU in Russian (Massive Multitask Language Understanding)

Overview of the Dataset

MMLU dataset for EN/RU, without auxiliary train. The dataset contains dev / val / test splits for both, English and Russian languages. Note it doesn't include auxiliary_train split, which wasn't translated. Totally the dataset has ~16k samples per language: 285 dev , 1531 val , 14042 test .

Description of original MMLU

MMLU dataset covers 57 different tasks. Each task requires to choose the right answer out of four options for a given question. Paper "Measuring Massive Multitask Language Understanding": https://arxiv.org/abs/2009.03300v3 . It is also known as the "hendrycks_test".

Dataset Creation

The translation was made via Yandex.Translate API. There are some translation mistakes, especially observed with terms and formulas, no fixes were applied. Initial dataset was taken from: https://people.eecs.berkeley.edu/~hendrycks/data.tar .

Sample example

{
    "question_en": "Why doesn't Venus have seasons like Mars and Earth do?",
    "choices_en": [
        "Its rotation axis is nearly perpendicular to the plane of the Solar System.",
        "It does not have an ozone layer.",
        "It does not rotate fast enough.",
        "It is too close to the Sun."
    ],
    "answer": 0,
    "question_ru": "Почему на Венере нет времен года, как на Марсе и Земле?",
    "choices_ru": [
        "Ось его вращения почти перпендикулярна плоскости Солнечной системы.",
        "У него нет озонового слоя.",
        "Он вращается недостаточно быстро.",
        "Это слишком близко к Солнцу."
    ]
}

Usage

To merge all subsets into dataframe per split:

from collections import defaultdict

import datasets
import pandas as pd


subjects = ["abstract_algebra", "anatomy", "astronomy", "business_ethics", "clinical_knowledge", "college_biology", "college_chemistry", "college_computer_science", "college_mathematics", "college_medicine", "college_physics", "computer_security", "conceptual_physics", "econometrics", "electrical_engineering", "elementary_mathematics", "formal_logic", "global_facts", "high_school_biology", "high_school_chemistry", "high_school_computer_science", "high_school_european_history", "high_school_geography", "high_school_government_and_politics", "high_school_macroeconomics", "high_school_mathematics", "high_school_microeconomics", "high_school_physics", "high_school_psychology", "high_school_statistics", "high_school_us_history", "high_school_world_history", "human_aging", "human_sexuality", "international_law", "jurisprudence", "logical_fallacies", "machine_learning", "management", "marketing", "medical_genetics", "miscellaneous", "moral_disputes", "moral_scenarios", "nutrition", "philosophy", "prehistory", "professional_accounting", "professional_law", "professional_medicine", "professional_psychology", "public_relations", "security_studies", "sociology", "us_foreign_policy", "virology", "world_religions"]

splits = ["dev", "val", "test"]

all_datasets = {x: datasets.load_dataset("NLPCoreTeam/mmlu_ru", name=x) for x in subjects}

res = defaultdict(list)
for subject in subjects:
    for split in splits:
        dataset = all_datasets[subject][split]
        df = dataset.to_pandas()
        int2str = dataset.features['answer'].int2str
        df['answer'] = df['answer'].map(int2str)
        df.insert(loc=0, column='subject_en', value=subject)
        res[split].append(df)

res = {k: pd.concat(v) for k, v in res.items()}

df_dev = res['dev']
df_val = res['val']
df_test = res['test']

Evaluation

This dataset is intended to evaluate LLMs with few-shot/zero-shot setup.

Evaluation code: https://github.com/NLP-Core-Team/mmlu_ru

Also resources might be helpful:

  • https://github.com/hendrycks/test
  • https://github.com/openai/evals/blob/main/examples/mmlu.ipynb
  • https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/hendrycks_test.py
  • Contributions

    Dataset added by NLP core team RnD Telegram channel