Dataset Card for "CSAT-QA"

Dataset Summary

The field of Korean Language Processing is experiencing a surge in interest, illustrated by the introduction of open-source models such as Polyglot-Ko and proprietary models like HyperClova. Yet, as the development of larger and superior language models accelerates, evaluation methods aren't keeping pace. Recognizing this gap, we at HAE-RAE are dedicated to creating tailored benchmarks for the rigorous evaluation of these models.

CSAT-QA is a comprehensive collection of 936 multiple choice question answering (MCQA) questions, manually collected the College Scholastic Ability Test (CSAT), a rigorous Korean University entrance exam. The CSAT-QA is divided into two subsets: a complete version encompassing all 936 questions, and a smaller, specialized version used for targeted evaluations.

The smaller subset further diversifies into six distinct categories: Writing (WR), Grammar (GR), Reading Comprehension: Science (RCS), Reading Comprehension: Social Science (RCSS), Reading Comprehension: Humanities (RCH), and Literature (LI). Moreover, the smaller subset includes the recorded accuracy of South Korean students, providing a valuable real-world performance benchmark.

For a detailed explanation of how the CSAT-QA was created please check out the accompanying blog post , and for evaluation check out LM-Eval-Harness on github.

Evaluation Results

Models	GR	LI	RCH	RCS	RCSS	WR	Average
polyglot-ko-12.8B	16.0	10.81	8.57	32.43	14.29	0.00	13.68
gpt-3.5-wo-token	16.0	32.43	42.86	18.92	35.71	0.00	24.32
gpt-3.5-w-token	16.0	35.14	42.86	18.92	35.71	9.09	26.29
gpt-4-wo-token	40.0	54.05	68.57	59.46	69.05	36.36	54.58
gpt-4-w-token	36.0	56.76	68.57	59.46	69.05	36.36	54.37
Human Performance	45.41	54.38	48.7	39.93	44.54	54.0	47.83

How to Use

The CSAT-QA includes two subsets. The full version with 936 questions can be downloaded using the following code:

from datasets import load_dataset
dataset = load_dataset("EleutherAI/CSAT-QA", "full")

A more condensed version, which includes human accuracy data, can be downloaded using the following code:

from datasets import load_dataset
import pandas as pd

dataset = load_dataset("EleutherAI/CSAT-QA", "GR") # Choose from either WR, GR, LI, RCH, RCS, RCSS,

Evaluate using LM-Eval-Harness

To evaluate your model simply by using the LM-Eval-Harness by EleutherAI follow the steps below.

To install lm-eval from the github repository main branch, run:

git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

To install additional multilingual tokenization and text segmentation packages, you must install the package with the multilingual extra:

pip install -e ".[multilingual]"

Run the evaluation by:

python main.py \
    --model hf-causal \
    --model_args pretrained=EleutherAI/polyglot-ko-1.3b \
    --tasks csatqa_wr,csatqa_gr,csatqa_rcs,csatqa_rcss,csatqa_rch,csatqa_li \
    --device cuda:0

License

The copyright of this material belongs to the Korea Institute for Curriculum and Evaluation(한국교육과정평가원) and may be used for research purposes only.

More Information needed

作者:

HAERAE-HUB

数据集大小:

9.2 MB