数据集:

shibing624/source_code

任务:

文本生成

子任务:

language-modeling

语言:

计算机处理:

monolingual

大小:

size_categories:100M<n<200M

语言创建人:

crowdsourced

批注创建人:

no-annotation

源数据集:

https https https

许可:

cc-by-4.0

gfdl

数据集介绍文件清单

中文

Dataset Card for "SourceCode"

Dataset Summary

Source code dataset is a collection of Github awesome repos, it contains Python, Java, C++, and other programming languages. This dataset can be used in different NLP tasks like language modeling and text generation tasks.

data source:

PYTHON_CODE: https://github.com/bharathgs/Awesome-pytorch-list
JAVA_CODE: https://github.com/akullpp/awesome-java
CPP_CODE: https://github.com/fffaraz/awesome-cpp

Supported Tasks and Leaderboards

language modeling
code generation tasks, Leaderboard: code-autocomplete

Languages

programming languages: Python, Java, C++
natural language: English

Dataset Structure

Data Instances

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "text": """
import json
import argparse


def _parse_args():
    parser = argparse.ArgumentParser(
        description=__doc__,
        formatter_class=argparse.RawTextHelpFormatter,
    )
    parser.add_argument(
        '--model-file',
        required=True,
        help=(
            'A pt file from '
            'https://github.com/pytorch/fairseq/tree/main/examples/hubert'
        )
    )
    return parser.parse_args()
    """
}

Data Fields

The data fields are the same among all splits.

text : a string feature.

Data Splits

python

$ wc -l python/*
   10000 python/test.txt
 5215412 python/train.txt
   10000 python/valid.txt
 5235412 total

java

$ wc -l java/*  
  950083 java/test.txt
 2802880 java/train.txt
  940803 java/valid.txt
 4693766 total

cpp

$ wc -l cpp/* 
 1060014 cpp/test.txt
 3119241 cpp/train.txt
 1099124 cpp/valid.txt
 5278379 total

Dataset Creation

Curation Rationale

As code generation dataset, I upload it to huggingface datasets.

Source Data

Initial Data Collection and Normalization Who are the source language producers?

Citation:

APA:

Xu, M. code-autocomplete: Code AutoComplete with GPT2 model (Version 0.0.4) [Computer software]. https://github.com/shibing624/code-autocomplete

BibTeX:

@software{Xu_code-autocomplete_Code_AutoComplete,
author = {Xu, Ming},
title = {code-autocomplete: Code AutoComplete with GPT2 model},
url = {https://github.com/shibing624/code-autocomplete},
version = {0.0.4}
}

Annotations

Annotation process Who are the annotators?

nobody

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

This dataset was developed as a benchmark for evaluating code generation model.

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Github awesome programing code repos.

Licensing Information

GNU Free Documentation License v1.3 or later.

For research use only.

Contributions

Thanks to @shibing624 add this dataset.

作者:

shibing624

数据集大小:

9.93 KB