数据集:
code_search_net
语言:
code计算机处理:
multilingual语言创建人:
machine-generated批注创建人:
no-annotation源数据集:
original预印本库:
arxiv:1909.09436许可:
otherCodeSearchNet corpus is a dataset of 2 milllion (comment, code) pairs from opensource libraries hosted on GitHub. It contains code and documentation for several programming languages.
CodeSearchNet corpus was gathered to support the CodeSearchNet challenge , to explore the problem of code retrieval using natural language.
A data point consists of a function code along with its documentation. Each data point also contains meta data on the function, such as the repository it was extracted from.
{ 'id': '0', 'repository_name': 'organisation/repository', 'func_path_in_repository': 'src/path/to/file.py', 'func_name': 'func', 'whole_func_string': 'def func(args):\n"""Docstring"""\n [...]', 'language': 'python', 'func_code_string': '[...]', 'func_code_tokens': ['def', 'func', '(', 'args', ')', ...], 'func_documentation_string': 'Docstring', 'func_documentation_string_tokens': ['Docstring'], 'split_name': 'train', 'func_code_url': 'https://github.com/<org>/<repo>/blob/<hash>/src/path/to/file.py#L111-L150' }
Three splits are available:
[More Information Needed]
All information can be retrieved in the original technical review
Corpus collection :
Corpus has been collected from publicly available open-source non-fork GitHub repositories, using libraries.io to identify all projects which are used by at least one other project, and sort them by “popularity” as indicated by the number of stars and forks.
Then, any projects that do not have a license or whose license does not explicitly permit the re-distribution of parts of the project were removed. Treesitter - GitHub's universal parser - has been used to then tokenize all Go, Java, JavaScript, Python, PHP and Ruby functions (or methods) using and, where available, their respective documentation text using a heuristic regular expression.
Corpus filtering :
Functions without documentation are removed from the corpus. This yields a set of pairs ($c_i$, $d_i$) where ci is some function documented by di. Pairs ($c_i$, $d_i$) are passed through the folllowing preprocessing tasks:
OpenSource contributors produced the code and documentations.
The dataset was gatherered and preprocessed automatically.
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Each example in the dataset has is extracted from a GitHub repository, and each repository has its own license. Example-wise license information is not (yet) included in this dataset: you will need to find out yourself which license the code is using.
@article{husain2019codesearchnet, title={{CodeSearchNet} challenge: Evaluating the state of semantic code search}, author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc}, journal={arXiv preprint arXiv:1909.09436}, year={2019} }
Thanks to @SBrandeis for adding this dataset.