数据集:
mbpp
任务:
文生文语言:
en计算机处理:
monolingual大小:
n<1K源数据集:
original预印本库:
arxiv:2108.07732其他:
code-generation许可:
cc-by-4.0The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. As described in the paper, a subset of the data has been hand-verified by us.
Released here as part of Program Synthesis with Large Language Models, Austin et. al., 2021 .
This dataset is used to evaluate code generations.
English - Python code
dataset_full = load_dataset("mbpp") DatasetDict({ test: Dataset({ features: ['task_id', 'text', 'code', 'test_list', 'test_setup_code', 'challenge_test_list'], num_rows: 974 }) }) dataset_sanitized = load_dataset("mbpp", "sanitized") DatasetDict({ test: Dataset({ features: ['source_file', 'task_id', 'prompt', 'code', 'test_imports', 'test_list'], num_rows: 427 }) })
{ 'task_id': 1, 'text': 'Write a function to find the minimum cost path to reach (m, n) from (0, 0) for the given cost matrix cost[][] and a position (m, n) in cost[][].', 'code': 'R = 3\r\nC = 3\r\ndef min_cost(cost, m, n): \r\n\ttc = [[0 for x in range(C)] for x in range(R)] \r\n\ttc[0][0] = cost[0][0] \r\n\tfor i in range(1, m+1): \r\n\t\ttc[i][0] = tc[i-1][0] + cost[i][0] \r\n\tfor j in range(1, n+1): \r\n\t\ttc[0][j] = tc[0][j-1] + cost[0][j] \r\n\tfor i in range(1, m+1): \r\n\t\tfor j in range(1, n+1): \r\n\t\t\ttc[i][j] = min(tc[i-1][j-1], tc[i-1][j], tc[i][j-1]) + cost[i][j] \r\n\treturn tc[m][n]', 'test_list': [ 'assert min_cost([[1, 2, 3], [4, 8, 2], [1, 5, 3]], 2, 2) == 8', 'assert min_cost([[2, 3, 4], [5, 9, 3], [2, 6, 4]], 2, 2) == 12', 'assert min_cost([[3, 4, 5], [6, 10, 4], [3, 7, 5]], 2, 2) == 16'], 'test_setup_code': '', 'challenge_test_list': [] }mbpp - sanitized
{ 'source_file': 'Benchmark Questions Verification V2.ipynb', 'task_id': 2, 'prompt': 'Write a function to find the shared elements from the given two lists.', 'code': 'def similar_elements(test_tup1, test_tup2):\n res = tuple(set(test_tup1) & set(test_tup2))\n return (res) ', 'test_imports': [], 'test_list': [ 'assert set(similar_elements((3, 4, 5, 6),(5, 7, 4, 10))) == set((4, 5))', 'assert set(similar_elements((1, 2, 3, 4),(5, 4, 3, 7))) == set((3, 4))', 'assert set(similar_elements((11, 12, 14, 13),(17, 15, 14, 13))) == set((13, 14))' ] }
There are two version of the dataset (full and sanitized), each with four splits:
The prompt split corresponds to samples used for few-shot prompting and not for training.
See section 2.1 of original paper .
In order to evaluate code generation functions a set of simple programming tasks as well as solutions is necessary which this dataset provides.
The dataset was manually created from scratch.
Who are the source language producers?The dataset was created with an internal crowdsourcing effort at Google.
The full dataset was created first and a subset then underwent a second round to improve the task descriptions.
Who are the annotators?The dataset was created with an internal crowdsourcing effort at Google.
None.
Make sure you execute generated Python code in a safe environment when evauating against this dataset as generated code could be harmful.
With this dataset code generating models can be better evaluated which leads to fewer issues introduced when using such models.
Since the task descriptions might not be expressive enough to solve the task. The sanitized split aims at addressing this issue by having a second round of annotators improve the dataset.
Google Research
CC-BY-4.0
@article{austin2021program, title={Program Synthesis with Large Language Models}, author={Austin, Jacob and Odena, Augustus and Nye, Maxwell and Bosma, Maarten and Michalewski, Henryk and Dohan, David and Jiang, Ellen and Cai, Carrie and Terry, Michael and Le, Quoc and others}, journal={arXiv preprint arXiv:2108.07732}, year={2021}
Thanks to @lvwerra for adding this dataset.