数据集:
codeparrot/instructhumaneval
InstructHumanEval is a modified version of OpenAI HumanEval. For a given prompt, we extracted its signature, its docstring as well as its header to create a flexing setting which would allow to evaluation instruction-tuned LLM. The delimiters used in the instruction-tuning procedure can be use to build and instruction that would allow the model to elicit its best capabilities. Here is an example of use
The prompt can be built as follows, depending on the model's instruction tuning delimiters
from datasets import load_dataset ds = load_dataset("codeparrot/instructhumaneval", split="test", use_auth_token=True) prompt_0 = "Human\n" + ds[0]["instruction"] + "\nAssistant\n" + ds[0]["context"] print(prompt_0)
Output
Human: Write a function has_close_elements(numbers: List[float], threshold: float) -> bool to solve the following problem: Check if in given list of numbers, are any two numbers closer to each other than given threshold. >>> has_close_elements([1.0, 2.0, 3.0], 0.5) False >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) True Assistant: from typing import List def has_close_elements(numbers: List[float], threshold: float) -> bool:
The model can therefore complete the instruction and yield better results because it fits its training procedure.
You can also find the code to evaluate models on the dataset in the BigCode-evaluation-harness . The following sections provide more details on the dataset.
This dataset is a modified version of OpenAI HumanEval that is designed to adapt the benchmark to instruction fine-tuned models. As a matter of fact, HumanEval evaluates the ability to complete a code given its signature, its docstring and potentially some auxiliary functions.
In order to build an instruction version of HumanEval we extracted relevant information from the prompt column of the original version
<context> <signature> <docstring>
And build and instruction that would be
Write a function <signature> to solve the following problem: <docstring>
From this instruction, we can design an evaluation pipeline for instruction fine-tuned languages models.
Instruction fine-tuned LLM are built by fine-tuning a base LLM on an instruction dataset. This instruction dataset contains several <input, output> pairs where each represent an instruction submitted by a user together with the right answer to it. These pairs are framed into a multi-turn conversation with the help of special tokens which design each member of the interaction e.g. Q user_token Human: , an assistant_token Assistant: and and end_token \n that designates the end of each turn.
In this case, the LLM is provided with the following prompt
user_token + <instruction> + <end_token> + <assistant_token> + <context>
It is the expected to complete the function to solve the problem formulated by the instruction . It is very similar to the original evaluation with the advantage that it puts the model in the best condition to understand the task that it is asked to solve. The evaluation is done on the part generated after <assistant_token> .
This setting is more complicated as it requires to model to account for the information contained in the instruction such as the function signature. The LLM is provided with the following prompt
user_token + <instruction> + <end_token> + <assistant_token>
The model has to generate a function with the correct signature that solve adequately the problem. The evaluation is done by identifying the content of the function in the generation (by search for the right entry_point / function_name ) and concatenating it with the <context> provided.
from datasets import load_dataset ds = load_dataset("codeparrot/instructhumaneval")
ds DatasetDict({ test: Dataset({ features: ['task_id', 'prompt', 'canonical_solution', 'test', 'entry_point', 'signature', 'docstring', 'context', 'instruction'], num_rows: 164 }) })