数据集:
wellesley-easel/StudentEval
StudentEval is a dataset of 1,749 prompts for 48 problems, authored by 80 students who have only completed a one-semester Python programming class. To the best of our knowledge, it is the first dataset that has multiple prompts per problem and multiple attempts by the same participant . We identify four key disjoint subsets of StudentEval for each problem-participant pair:
During the experiment, we produced one completion per attempt. However, we can treat these prompts as a benchmark by repeatedly sampling completions to calculuate pass@k rates for each subset.
For the total_bill problem, we showed students the following signature and input/output examples:
def total_bill(grocery_list, sales_tax):
Input | Output |
---|---|
[['apples', 6, 0.99],['milk', 1, 1.49],['bread', 2, 3.50]], 0.07 | 15.44 |
[['apples', 6, 0.99],['milk', 1, 1.49],['bread', 2, 3.50]], 0.0 | 14.43 |
[['bread', 2, 3.50]], 0.5 | 10.5 |
These are some examples of successful prompts in StudentEval:
And these are some examples of unsuccessful prompts:
The code to run StudentEval is based on the BigCode Evaluation Harness .
Download our branch of the BigCode Evaluation Harness:
git clone https://github.com/arjunguha/bigcode-evaluation-harness/
Install its dependencies (see the README file).
Run the studenteval task. The following command evaluates SantaCoder:
python3 main.py --model bigcode/gpt_bigcode-santacoder --tasks studenteval --max_length_generation 512 --n_samples 20 --batch_size 20 --precision bf16 --allow_code_execution
and will produce output similar to the following:
Selected Tasks: ['studenteval'] Loading tokenizer and model (in bf16) 100%|_______________________________________________________________________________________________________________________________________________________________________________| 1/1 [00:00<00:00, 519.93it/s] 100%|_______________________________________________________________________________________________________________________________________________________________________________| 1/1 [00:00<00:00, 680.12it/s] number of problems for this task is 1027 100%|__________________________________________________________________________________________________________________________________________________________________________| 1027/1027 [32:51<00:00, 1.92s/it] generations were saved at generations.json Evaluating generations... 100%|_______________________________________________________________________________________________________________________________________________________________________| 20540/20540 [01:21<00:00, 252.84it/s] { "studenteval": [ { "group": "First Failure", "pass1": 0.022333333333333334 }, { "group": "First Success", "pass1": 0.3195187165775401 }, { "group": "Last Failure", "pass1": 0.02195121951219512 }, { "group": "Last Success", "pass1": 0.21405405405405406 } ], "config": { "model": "bigcode/gpt_bigcode-santacoder", "temperature": 0.2, "n_samples": 20 } }
This command uses 5GB VRAM on an Ampere-series GPU and takes roughly 30 minutes to generate completions and 10 minutes to execute them with 8 cores.
Paper: https://arxiv.org/abs/2306.04556
@misc{studenteval, title={StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code}, author={Hannah McLean Babe and Sydney Nguyen and Yangtian Zi and Arjun Guha and Molly Q Feldman and Carolyn Jane Anderson}, year={2023}, }