数据集:
codeparrot/apps
APPS is a benchmark for code generation with 10000 problems. It can be used to evaluate the ability of language models to generate code from natural language specifications. You can also find APPS metric in the hub here codeparrot/apps_metric .
The dataset contains questions in English and code solutions in Python.
from datasets import load_dataset load_dataset("codeparrot/apps") DatasetDict({ train: Dataset({ features: ['problem_id', 'question', 'solutions', 'input_output', 'difficulty', 'url', 'starter_code'], num_rows: 5000 }) test: Dataset({ features: ['problem_id', 'question', 'solutions', 'input_output', 'difficulty', 'url', 'starter_code'], num_rows: 5000 }) })
You can load and iterate through the dataset with the following two lines of code for the train split:
from datasets import load_dataset import json ds = load_dataset("codeparrot/apps", split="train") sample = next(iter(ds)) # non-empty solutions and input_output features can be parsed from text format this way: sample["solutions"] = json.loads(sample["solutions"]) sample["input_output"] = json.loads(sample["input_output"]) print(sample) #OUTPUT: { 'problem_id': 0, 'question': 'Polycarp has $n$ different binary words. A word called binary if it contains only characters \'0\' and \'1\'. For example...', 'solutions': ["for _ in range(int(input())):\n n = int(input())\n mass = []\n zo = 0\n oz = 0\n zz = 0\n oo = 0\n...",...], 'input_output': {'inputs': ['4\n4\n0001\n1000\n0011\n0111\n3\n010\n101\n0\n2\n00000\n00001\n4\n01\n001\n0001\n00001\n'], 'outputs': ['1\n3 \n-1\n0\n\n2\n1 2 \n']}, 'difficulty': 'interview', 'url': 'https://codeforces.com/problemset/problem/1259/D', 'starter_code': ''} }
Each sample consists of a programming problem formulation in English, some ground truth Python solutions, test cases that are defined by their inputs and outputs and function name if provided, as well as some metadata regarding the difficulty level of the problem and its source.
If a sample has non empty input_output feature, you can read it as a dictionary with keys inputs and outputs and fn_name if it exists, and similarily you can parse the solutions into a list of solutions as shown in the code above.
You can also filter the dataset for the difficulty level: Introductory, Interview and Competition. Just pass the list of difficulties as a list. E.g. if you want the most challenging problems, you need to select the competition level:
ds = load_dataset("codeparrot/apps", split="train", difficulties=["competition"]) print(next(iter(ds))["question"]) #OUTPUT: """\ Codefortia is a small island country located somewhere in the West Pacific. It consists of $n$ settlements connected by ... For each settlement $p = 1, 2, \dots, n$, can you tell what is the minimum time required to travel between the king's residence and the parliament house (located in settlement $p$) after some roads are abandoned? -----Input----- The first line of the input contains four integers $n$, $m$, $a$ and $b$ ... -----Output----- Output a single line containing $n$ integers ... -----Examples----- Input 5 5 20 25 1 2 25 ... Output 0 25 60 40 20 ...
Field | Type | Description |
---|---|---|
problem_id | int | problem id |
question | string | problem description |
solutions | string | some python solutions |
input_output | string | Json string with "inputs" and "outputs" of the test cases, might also include "fn_name" the name of the function |
difficulty | string | difficulty level of the problem |
url | string | url of the source of the problem |
starter_code | string | starter code to include in prompts |
we mention that only few samples have fn_name and starter_code specified
The dataset contains a train and test splits with 5000 samples each.
To create the APPS dataset, the authors manually curated problems from open-access sites where programmers share problems with each other, including Codewars, AtCoder, Kattis, and Codeforces. For more details please refer to the original paper .
In AlphaCode the authors found that this dataset can generate many false positives during evaluation, where incorrect submissions are marked as correct due to lack of test coverage.
@article{hendrycksapps2021, title={Measuring Coding Challenge Competence With APPS}, author={Dan Hendrycks and Steven Basart and Saurav Kadavath and Mantas Mazeika and Akul Arora and Ethan Guo and Collin Burns and Samir Puranik and Horace He and Dawn Song and Jacob Steinhardt}, journal={NeurIPS}, year={2021} }