数据集:
Nan-Do/code-search-net-python
This dataset is the Python portion of the CodeSarchNet annotated with a summary column. The code-search-net dataset includes open source functions that include comments found at GitHub. The summary is a short description of what the function does.
The dataset's comments are in English and the functions are coded in Python
Train, test, validation labels are included in the dataset as a column.
May of 2023
This dataset can be used to generate instructional (or many other interesting) datasets that are useful to train LLMs
The CodeSearchNet dataset can be found at https://www.kaggle.com/datasets/omduggineni/codesearchnet
This datasets include a summary column including a short description of the function.
Annotation processThe annotation procedure was done using Salesforce T5 summarization models. A sample notebook of the process can be found at https://github.com/Nan-Do/OpenAssistantInstructionResponsePython The annontations have been cleaned to make sure there are no repetitions and/or meaningless summaries. (some may still be present in the dataset)
Apache 2.0