数据集:

shibing624/source_code

中文

Dataset Card for "SourceCode"

Dataset Summary

Source code dataset is a collection of Github awesome repos, it contains Python, Java, C++, and other programming languages. This dataset can be used in different NLP tasks like language modeling and text generation tasks.

data source:

Supported Tasks and Leaderboards

Languages

  • programming languages: Python, Java, C++
  • natural language: English

Dataset Structure

Data Instances

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "text": """
import json
import argparse


def _parse_args():
    parser = argparse.ArgumentParser(
        description=__doc__,
        formatter_class=argparse.RawTextHelpFormatter,
    )
    parser.add_argument(
        '--model-file',
        required=True,
        help=(
            'A pt file from '
            'https://github.com/pytorch/fairseq/tree/main/examples/hubert'
        )
    )
    return parser.parse_args()
    """
}

Data Fields

The data fields are the same among all splits.

  • text : a string feature.

Data Splits

python
$ wc -l python/*
   10000 python/test.txt
 5215412 python/train.txt
   10000 python/valid.txt
 5235412 total
java
$ wc -l java/*  
  950083 java/test.txt
 2802880 java/train.txt
  940803 java/valid.txt
 4693766 total
cpp
$ wc -l cpp/* 
 1060014 cpp/test.txt
 3119241 cpp/train.txt
 1099124 cpp/valid.txt
 5278379 total

Dataset Creation

Curation Rationale

As code generation dataset, I upload it to huggingface datasets.

Source Data

Initial Data Collection and Normalization Who are the source language producers?

Citation:

APA:

Xu, M. code-autocomplete: Code AutoComplete with GPT2 model (Version 0.0.4) [Computer software]. https://github.com/shibing624/code-autocomplete

BibTeX:

@software{Xu_code-autocomplete_Code_AutoComplete,
author = {Xu, Ming},
title = {code-autocomplete: Code AutoComplete with GPT2 model},
url = {https://github.com/shibing624/code-autocomplete},
version = {0.0.4}
}

Annotations

Annotation process Who are the annotators?

nobody

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

This dataset was developed as a benchmark for evaluating code generation model.

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Github awesome programing code repos.

Licensing Information

GNU Free Documentation License v1.3 or later.

For research use only.

Contributions

Thanks to @shibing624 add this dataset.