数据集:
codeparrot/conala-mined-curated
数字对象标识符:
10.57967/hf/0755Conala-mined-curatedd is a dataset that is based on the mined subset of the CoNaLa dataset . conala is a dataset crawled from Stack Overflow. Part of it is filtered and curated to from a training set and a test set. However, the mined part is not comparably post-processed. It is a set of 600K examples that we decided to work on.
The conala datasets have 3 columns of interest. We give their description as provided by the authors
For instruction fine-tuning, we would be interested to train a model to map the rewritten_intent to the snippet . However, the mined subset does not have the column rewritten_intent . intent is to vague to be describe as an instruction so we have to find a way to build the column rewritten_intent for the mined subset. That is exactly what was done in order to build this dataset.
The most valuable information that we have in order to recover the column rewritten_intent are the columns intent and snippet . Fortunately we also have the training set and the test set of conala which are labeled. It means that we have a view of what a high quality triplet ( intent , rewritten_intent , snippet ) look like. We had the idea to build a Seq2Seq model whose role would be to reconstruct the rewritten_intent based on the concatenation [ intent , snippet ].
More precisely, we fine-tuned google UL2 to solve this task.
from datasets import load_dataset dataset = load_dataset("codeparrot/conala-mined-curated") dataset DatasetDict({ train: Dataset({ features: ['question_id', 'parent_answer_post_id', 'prob', 'snippet', 'intent', 'rewritten_intent', 'id'], num_rows: 593891 }) })