数据集:
shibing624/source_code
Source code dataset is a collection of Github awesome repos, it contains Python, Java, C++, and other programming languages. This dataset can be used in different NLP tasks like language modeling and text generation tasks.
data source:
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"text": """
import json
import argparse
def _parse_args():
parser = argparse.ArgumentParser(
description=__doc__,
formatter_class=argparse.RawTextHelpFormatter,
)
parser.add_argument(
'--model-file',
required=True,
help=(
'A pt file from '
'https://github.com/pytorch/fairseq/tree/main/examples/hubert'
)
)
return parser.parse_args()
"""
}
The data fields are the same among all splits.
$ wc -l python/* 10000 python/test.txt 5215412 python/train.txt 10000 python/valid.txt 5235412 totaljava
$ wc -l java/* 950083 java/test.txt 2802880 java/train.txt 940803 java/valid.txt 4693766 totalcpp
$ wc -l cpp/* 1060014 cpp/test.txt 3119241 cpp/train.txt 1099124 cpp/valid.txt 5278379 total
As code generation dataset, I upload it to huggingface datasets.
Citation:
APA:
Xu, M. code-autocomplete: Code AutoComplete with GPT2 model (Version 0.0.4) [Computer software]. https://github.com/shibing624/code-autocomplete
BibTeX:
@software{Xu_code-autocomplete_Code_AutoComplete,
author = {Xu, Ming},
title = {code-autocomplete: Code AutoComplete with GPT2 model},
url = {https://github.com/shibing624/code-autocomplete},
version = {0.0.4}
}
nobody
This dataset was developed as a benchmark for evaluating code generation model.
Github awesome programing code repos.
GNU Free Documentation License v1.3 or later.
For research use only.
Thanks to @shibing624 add this dataset.