数据集:
shibing624/source_code
Source code dataset is a collection of Github awesome repos, it contains Python, Java, C++, and other programming languages. This dataset can be used in different NLP tasks like language modeling and text generation tasks.
data source:
An example of 'train' looks as follows.
This example was too long and was cropped: { "text": """ import json import argparse def _parse_args(): parser = argparse.ArgumentParser( description=__doc__, formatter_class=argparse.RawTextHelpFormatter, ) parser.add_argument( '--model-file', required=True, help=( 'A pt file from ' 'https://github.com/pytorch/fairseq/tree/main/examples/hubert' ) ) return parser.parse_args() """ }
The data fields are the same among all splits.
$ wc -l python/* 10000 python/test.txt 5215412 python/train.txt 10000 python/valid.txt 5235412 totaljava
$ wc -l java/* 950083 java/test.txt 2802880 java/train.txt 940803 java/valid.txt 4693766 totalcpp
$ wc -l cpp/* 1060014 cpp/test.txt 3119241 cpp/train.txt 1099124 cpp/valid.txt 5278379 total
As code generation dataset, I upload it to huggingface datasets.
Citation:
APA:
Xu, M. code-autocomplete: Code AutoComplete with GPT2 model (Version 0.0.4) [Computer software]. https://github.com/shibing624/code-autocomplete
BibTeX:
@software{Xu_code-autocomplete_Code_AutoComplete, author = {Xu, Ming}, title = {code-autocomplete: Code AutoComplete with GPT2 model}, url = {https://github.com/shibing624/code-autocomplete}, version = {0.0.4} }
nobody
This dataset was developed as a benchmark for evaluating code generation model.
Github awesome programing code repos.
GNU Free Documentation License v1.3 or later.
For research use only.
Thanks to @shibing624 add this dataset.