数据集:
Fsoft-AIC/the-vault-function
The Vault dataset is a comprehensive, large-scale, multilingual parallel dataset that features high-quality code-text pairs derived from The Stack, the largest permissively-licensed source code dataset.
We provide The Vault which contains code snippets from 10 popular programming languages such as Java, JavaScript, Python, Ruby, Rust, Golang, C#, C++, C, and PHP. This dataset provides multiple code-snippet levels, metadata, and 11 docstring styles for enhanced usability and versatility.
The Vault can be used for pretraining LLMs or downstream code-text interaction tasks. A number of tasks related to code understanding and geneartion can be constructed using The Vault such as code summarization , text-to-code generation and code search .
The natural language text (docstring) is in English.
10 programming languages are supported in The Vault: Python , Java , JavaScript , PHP , C , C# , C++ , Go , Ruby , Rust
{ "hexsha": "5c47f0b4c173a8fd03e4e633d9b3dd8211e67ad0", "repo": "neumanna94/beepboop", "path": "js/scripts.js", "license": [ "MIT" ], "language": "JavaScript", "identifier": "beepBoopSelector", "return_type": "<not_specific>", "original_string": "function beepBoopSelector(inputString, bbFunction){\n if(bbFunction==1){\n return beepBoop(inputString);\n } else if(bbFunction==2){\n return beepBoop2(inputString);\n } else if(bbFunction==3){\n return beepBoop3(inputString);\n } else {\n }\n}", "original_docstring": "//Determines what beepBoop function to use", "docstring": "Determines what beepBoop function to use", "docstring_tokens": [ "Determines", "what", "beepBoop", "function", "to", "use" ], "code": "function beepBoopSelector(inputString, bbFunction){\n if(bbFunction==1){\n return beepBoop(inputString);\n } else if(bbFunction==2){\n return beepBoop2(inputString);\n } else if(bbFunction==3){\n return beepBoop3(inputString);\n } else {\n }\n}", "code_tokens": [ "function", "beepBoopSelector", "(", "inputString", ",", "bbFunction", ")", "{", "if", "(", "bbFunction", "==", "1", ")", "{", "return", "beepBoop", "(", "inputString", ")", ";", "}", "else", "if", "(", "bbFunction", "==", "2", ")", "{", "return", "beepBoop2", "(", "inputString", ")", ";", "}", "else", "if", "(", "bbFunction", "==", "3", ")", "{", "return", "beepBoop3", "(", "inputString", ")", ";", "}", "else", "{", "}", "}" ], "short_docstring": "Determines what beepBoop function to use", "short_docstring_tokens": [ "Determines", "what", "beepBoop", "function", "to", "use" ], "comment": [], "parameters": [ { "param": "inputString", "type": null }, { "param": "bbFunction", "type": null } ], "docstring_params": { "returns": [], "raises": [], "params": [ { "identifier": "inputString", "type": null, "docstring": null, "docstring_tokens": [], "default": null, "is_optional": null }, { "identifier": "bbFunction", "type": null, "docstring": null, "docstring_tokens": [], "default": null, "is_optional": null } ], "outlier_params": [], "others": [] } }
Data fields for function level:
See here for more details and examples.
In this repo, The Vault is divided into 5 subsets, where three training versions are split based on size of the full training set, and the remains are validation set and test set (approximate 20,000 samples in each). The statistic for languages in each split set is illustrated in the following section.
Before split, the dataset is deduplicated. There are 3 versions of training set that are small (5%), medium (20%) and large (100%).
Dataset | #Language | #Code-text pair |
---|---|---|
PyMT5 | 1 | ≈ 7,700,000 |
CoDesc | 1 | 4,211,516 |
CodeSearchNet | 6 | 2,326,976 |
CodeSearchNet (CodeXGLUE) | 6 | 1,005,474 |
Deepcom | 1 | 424,028 |
CONCODE | 1 | 2,184,310 |
Funcom | 1 | 2,149,121 |
CodeT5 | 8 | 3,158,313 |
The Vault | 10 | 34,098,775 |
train/small | train/medium | train/full | validation | test | total | |
---|---|---|---|---|---|---|
Python | 370,657 | 1,952,110 | 7,772,647 | 30,992 | 21,652 | 7,825,291 |
Java | 351,213 | 1,612,366 | 6,629,193 | 22,677 | 15,552 | 6,667,422 |
JavaScript | 82,931 | 404,729 | 1,640,416 | 22,044 | 21,108 | 1,683,568 |
PHP | 236,638 | 1,155,476 | 4,656,371 | 21,375 | 19,010 | 4,696,756 |
C | 105,978 | 381,207 | 1,639,319 | 27,525 | 19,122 | 1,685,966 |
C# | 141,090 | 783,166 | 3,305,891 | 24,787 | 19,638 | 3,350,316 |
C++ | 87,420 | 410,907 | 1,671,268 | 20,011 | 18,169 | 1,709,448 |
Go | 267,535 | 1,319,547 | 5,109,020 | 19,102 | 25,314 | 5,153,436 |
Ruby | 23,921 | 112,574 | 424,339 | 17,338 | 19,908 | 461,585 |
Rust | 35,367 | 224,015 | 825,130 | 16,716 | 23,141 | 864,987 |
TOTAL | 1,702,750 | 8,356,097 | 33,673,594 | 222,567 | 202,614 | 34,098,775 |
You can load The Vault dataset using datasets library: pip install datasets
from datasets import load_dataset # Load full function level dataset (34M samples) dataset = load_dataset("Fsoft-AIC/the-vault-function") # Load function level train/validation/test set dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train"]) # Load "small" (or "medium", "full") version of function level training set dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train/small"]) # specific language (e.g. Python) dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train"], languages=['Python']) # dataset streaming data = load_dataset("Fsoft-AIC/the-vault-function", split_set= ["train"], streaming= True) for sample in iter(data['train']): print(sample)
A back up dataset can be downloaded in azure storage. See Download The Vault from Azure blob storage .
MIT License
@article{manh2023vault, title={The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation}, author={Manh, Dung Nguyen and Hai, Nam Le and Dau, Anh TV and Nguyen, Anh Minh and Nghiem, Khanh and Guo, Jin and Bui, Nghi DQ}, journal={arXiv preprint arXiv:2305.06156}, year={2023} }
This dataset is developed by FSOFT AI4Code team .