数据集:
codeparrot/self-instruct-starcoder
Self-instruct-starcoder is a dataset that was generated by prompting starcoder to generate new instructions based on some human-written seed instructions. The underlying process is explained in the paper self-instruct . This algorithm gave birth to famous machine generated datasets such as Alpaca and Code Alpaca which are two datasets obtained by prompting OpenAI text-davinci-003 engine.
While our method is similar to self-instruct and stanford alpaca, we included some relevant modifications to the pipeline to account for what we wanted.
The generation of the dataset was time consuming and we chose our parameters to limit the computational burden of our method.
StarCoder, while being a great model is not as capable as text-davinci-003 . In the generation, the model quickly reach sort of a ceiling in terms of creativity. There are many instructions that are similar to each other, but it should not bother since they are not phrased the same.
Post-processing is an important part of the pipeline since it improves the quality of the dataset despite the fact that it implies getting rid of some examples. First we need to identify what we want to avoid :
We imagined a process that we named self-consistency . The idea is to reverse-prompt the model to see if it can generate a sound instruction that corresponds to the solution (output) it is prompted with. This is a particularly difficult few-shot task, and unfortunately StarCoder does not perform incredibly well on it. With a few-shot parameters of 4 (all being seed tasks), the model is able to recover 1135 instructions out of 5003, which amount for 22.6% of the raw dataset. Fortunately, the inability for starcoder to generate instructions for some solutions does not mean we should get rid of them. For the solutions (outputs) with generated instructions, we can compare these with the ground truth. For that we can use Sentence-BERT because the comparison should focus the meaning rather than the word to word similarity ratio. We have about 771 instructions (~68%) with a similarity score >= 0.5 with their ground truth. These can be seen as high quality examples, they form the curated set.
 
 Another approach that can be used to clean the raw dataset is to focus on distinct instructions. For a given instruction, we go through all the instructions generated before it to see if there is one with a similarity score >= 0.5. If it is the case, we remove that instruction. This process removes about 94% of the raw dataset, the remaining instructions form the unique set.
We also decided to build a set which contains solely the example featuring a code written in python 3 which does not code a compilation error.
from datasets import load_dataset
dataset = load_dataset("codeparrot/self-instruct-starcoder")
DatasetDict({
    compile: Dataset({
        features: ['instruction', 'output', 'most_similar', 'avg_similarity_score'],
        num_rows: 3549
    })
    curated: Dataset({
        features: ['instruction', 'output', 'most_similar', 'avg_similarity_score'],
        num_rows: 771
    })
    raw: Dataset({
        features: ['instruction', 'output', 'most_similar', 'avg_similarity_score'],
        num_rows: 5003
    })
    unique: Dataset({
        features: ['instruction', 'output', 'most_similar', 'avg_similarity_score'],
        num_rows: 308
    })
}))
 | Field | Type | Description | 
|---|---|---|
| instruction | string | Instruction | 
| output | string | Answer to the instruction | 
| most_similar | string | Dictionnary containing the 10 most similar instructions generated before the current instruction along with the similarity scores | 
| avg_similarity_score | float64 | Average similarity score | 
@misc{title={Self-Instruct-StarCoder},
      author={Zebaze, Armel Randy},
      doi={https://doi.org/10.57967/hf/0790},
}