数据集:

Fsoft-AIC/the-vault-function

任务:

文本生成

语言:

code

计算机处理:

multiprogramming languages multiprogramming+languages

预印本库:

arxiv:2305.06156

许可:

mit

数据集介绍文件清单

中文

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

Dataset Summary

The Vault dataset is a comprehensive, large-scale, multilingual parallel dataset that features high-quality code-text pairs derived from The Stack, the largest permissively-licensed source code dataset.

We provide The Vault which contains code snippets from 10 popular programming languages such as Java, JavaScript, Python, Ruby, Rust, Golang, C#, C++, C, and PHP. This dataset provides multiple code-snippet levels, metadata, and 11 docstring styles for enhanced usability and versatility.

Supported Tasks

The Vault can be used for pretraining LLMs or downstream code-text interaction tasks. A number of tasks related to code understanding and geneartion can be constructed using The Vault such as code summarization , text-to-code generation and code search .

Languages

The natural language text (docstring) is in English.

10 programming languages are supported in The Vault: Python , Java , JavaScript , PHP , C , C# , C++ , Go , Ruby , Rust

Dataset Structure

Data Instances

{

    "hexsha": "5c47f0b4c173a8fd03e4e633d9b3dd8211e67ad0",
    "repo": "neumanna94/beepboop",
    "path": "js/scripts.js",
    "license": [
        "MIT"
    ],
    "language": "JavaScript",
    "identifier": "beepBoopSelector",
    "return_type": "<not_specific>",
    "original_string": "function beepBoopSelector(inputString, bbFunction){\n  if(bbFunction==1){\n    return beepBoop(inputString);\n  } else if(bbFunction==2){\n    return beepBoop2(inputString);\n  } else if(bbFunction==3){\n    return beepBoop3(inputString);\n  } else {\n  }\n}",
    "original_docstring": "//Determines what beepBoop function to use",
    "docstring": "Determines what beepBoop function to use",
    "docstring_tokens": [
        "Determines",
        "what",
        "beepBoop",
        "function",
        "to",
        "use"
    ],
    "code": "function beepBoopSelector(inputString, bbFunction){\n  if(bbFunction==1){\n    return beepBoop(inputString);\n  } else if(bbFunction==2){\n    return beepBoop2(inputString);\n  } else if(bbFunction==3){\n    return beepBoop3(inputString);\n  } else {\n  }\n}",
    "code_tokens": [
        "function",
        "beepBoopSelector",
        "(",
        "inputString",
        ",",
        "bbFunction",
        ")",
        "{",
        "if",
        "(",
        "bbFunction",
        "==",
        "1",
        ")",
        "{",
        "return",
        "beepBoop",
        "(",
        "inputString",
        ")",
        ";",
        "}",
        "else",
        "if",
        "(",
        "bbFunction",
        "==",
        "2",
        ")",
        "{",
        "return",
        "beepBoop2",
        "(",
        "inputString",
        ")",
        ";",
        "}",
        "else",
        "if",
        "(",
        "bbFunction",
        "==",
        "3",
        ")",
        "{",
        "return",
        "beepBoop3",
        "(",
        "inputString",
        ")",
        ";",
        "}",
        "else",
        "{",
        "}",
        "}"
    ],

    "short_docstring": "Determines what beepBoop function to use",
    "short_docstring_tokens": [
        "Determines",
        "what",
        "beepBoop",
        "function",
        "to",
        "use"
    ],
    "comment": [],
    "parameters": [
        {
            "param": "inputString",
            "type": null
        },
        {
            "param": "bbFunction",
            "type": null
        }
    ],
    "docstring_params": {
        "returns": [],
        "raises": [],
        "params": [
            {
                "identifier": "inputString",
                "type": null,
                "docstring": null,
                "docstring_tokens": [],
                "default": null,
                "is_optional": null
            },
            {
                "identifier": "bbFunction",
                "type": null,
                "docstring": null,
                "docstring_tokens": [],
                "default": null,
                "is_optional": null
            }
        ],
        "outlier_params": [],
        "others": []
    }
}

Data Fields

Data fields for function level:

hexsha (string): the unique git hash of file
repo (string): the owner/repo
path (string): the full path to the original file
license (list): licenses in the repo
language (string): the programming language
identifier (string): the function or method name
return_type (string): the type returned by the function
original_string (string): original version of function/class node
original_docstring (string): the raw string before tokenization or parsing
code (string): the part of the original that is code
code_tokens (list): tokenized version of code
short_docstring (string): short, brief summarization (first line of the docstring)
short_docstring_tokens (list): tokenized version of `short_docstring
docstring (string): the top-level comment or docstring (docstring version without param’s doc, return, exception fields, etc)
docstring_tokens (list): tokenized version of docstring
comment (list): list of comments (line) inside the function/class
parameters (list): List of parameters and its type (type can be None)
docstring_params (dict): Dictionary of the parsed information from docstring

See here for more details and examples.

Data Splits

In this repo, The Vault is divided into 5 subsets, where three training versions are split based on size of the full training set, and the remains are validation set and test set (approximate 20,000 samples in each). The statistic for languages in each split set is illustrated in the following section.

Before split, the dataset is deduplicated. There are 3 versions of training set that are small (5%), medium (20%) and large (100%).

Dataset Statistics

Compare to other benchmarks

Dataset	#Language	#Code-text pair
PyMT5	1	≈ 7,700,000
CoDesc	1	4,211,516
CodeSearchNet	6	2,326,976
CodeSearchNet (CodeXGLUE)	6	1,005,474
Deepcom	1	424,028
CONCODE	1	2,184,310
Funcom	1	2,149,121
CodeT5	8	3,158,313
The Vault	10	34,098,775

Statistic for split sets

train/small	train/medium	train/full	validation	test	total
Python	370,657	1,952,110	7,772,647	30,992	21,652	7,825,291
Java	351,213	1,612,366	6,629,193	22,677	15,552	6,667,422
JavaScript	82,931	404,729	1,640,416	22,044	21,108	1,683,568
PHP	236,638	1,155,476	4,656,371	21,375	19,010	4,696,756
C	105,978	381,207	1,639,319	27,525	19,122	1,685,966
C#	141,090	783,166	3,305,891	24,787	19,638	3,350,316
C++	87,420	410,907	1,671,268	20,011	18,169	1,709,448
Go	267,535	1,319,547	5,109,020	19,102	25,314	5,153,436
Ruby	23,921	112,574	424,339	17,338	19,908	461,585
Rust	35,367	224,015	825,130	16,716	23,141	864,987
TOTAL	1,702,750	8,356,097	33,673,594	222,567	202,614	34,098,775

Usage

You can load The Vault dataset using datasets library: pip install datasets

from datasets import load_dataset

# Load full function level dataset (34M samples)
dataset = load_dataset("Fsoft-AIC/the-vault-function")

# Load function level train/validation/test set
dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train"])

# Load "small" (or "medium", "full") version of function level training set
dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train/small"])

# specific language (e.g. Python) 
dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train"], languages=['Python'])

# dataset streaming
data = load_dataset("Fsoft-AIC/the-vault-function", split_set= ["train"], streaming= True)
for sample in iter(data['train']): 
    print(sample)

A back up dataset can be downloaded in azure storage. See Download The Vault from Azure blob storage .

Additional information

Licensing Information

MIT License

Citation Information

@article{manh2023vault,
  title={The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation},
  author={Manh, Dung Nguyen and Hai, Nam Le and Dau, Anh TV and Nguyen, Anh Minh and Nghiem, Khanh and Guo, Jin and Bui, Nghi DQ},
  journal={arXiv preprint arXiv:2305.06156},
  year={2023}
}

Contributions

This dataset is developed by FSOFT AI4Code team .

作者:

Fsoft-AIC

数据集大小:

24.65 GB