数据集:

code_search_net

任务:

文本生成

填充掩码

子任务:

language-modeling masked-language-modeling

语言:

code

计算机处理:

multilingual

大小:

100K<n<1M 10K<n<100K 1M<n<10M

语言创建人:

machine-generated

批注创建人:

no-annotation

源数据集:

original

预印本库:

arxiv:1909.09436

许可:

other

数据集介绍文件清单

中文

Dataset Card for CodeSearchNet corpus

Dataset Summary

CodeSearchNet corpus is a dataset of 2 milllion (comment, code) pairs from opensource libraries hosted on GitHub. It contains code and documentation for several programming languages.

CodeSearchNet corpus was gathered to support the CodeSearchNet challenge , to explore the problem of code retrieval using natural language.

Supported Tasks and Leaderboards

language-modeling : The dataset can be used to train a model for modelling programming languages, which consists in building language models for programming languages.

Languages

Go programming language
Java programming language
Javascript programming language
PHP programming language
Python programming language
Ruby programming language

Dataset Structure

Data Instances

A data point consists of a function code along with its documentation. Each data point also contains meta data on the function, such as the repository it was extracted from.

{
  'id': '0',
  'repository_name': 'organisation/repository',
  'func_path_in_repository': 'src/path/to/file.py',
  'func_name': 'func',
  'whole_func_string': 'def func(args):\n"""Docstring"""\n [...]',
  'language': 'python', 
  'func_code_string': '[...]',
  'func_code_tokens': ['def', 'func', '(', 'args', ')', ...],
  'func_documentation_string': 'Docstring',
  'func_documentation_string_tokens': ['Docstring'],
  'split_name': 'train',
  'func_code_url': 'https://github.com/<org>/<repo>/blob/<hash>/src/path/to/file.py#L111-L150'
}

Data Fields

id : Arbitrary number
repository_name : name of the GitHub repository
func_path_in_repository : tl;dr: path to the file which holds the function in the repository
func_name : name of the function in the file
whole_func_string : Code + documentation of the function
language : Programming language in whoch the function is written
func_code_string : Function code
func_code_tokens : Tokens yielded by Treesitter
func_documentation_string : Function documentation
func_documentation_string_tokens : Tokens yielded by Treesitter
split_name : Name of the split to which the example belongs (one of train, test or valid)
func_code_url : URL to the function code on Github

Data Splits

Three splits are available:

train
test
valid

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

All information can be retrieved in the original technical review

Corpus collection :

Corpus has been collected from publicly available open-source non-fork GitHub repositories, using libraries.io to identify all projects which are used by at least one other project, and sort them by “popularity” as indicated by the number of stars and forks.

Then, any projects that do not have a license or whose license does not explicitly permit the re-distribution of parts of the project were removed. Treesitter - GitHub's universal parser - has been used to then tokenize all Go, Java, JavaScript, Python, PHP and Ruby functions (or methods) using and, where available, their respective documentation text using a heuristic regular expression.

Corpus filtering :

Functions without documentation are removed from the corpus. This yields a set of pairs ($c_i$, $d_i$) where ci is some function documented by di. Pairs ($c_i$, $d_i$) are passed through the folllowing preprocessing tasks:

Documentation $d_i$ is truncated to the first full paragraph to remove in-depth discussion of function arguments and return values
Pairs in which $d_i$ is shorter than three tokens are removed
Functions $c_i$ whose implementation is shorter than three lines are removed
Functions whose name contains the substring “test” are removed
Constructors and standard extenion methods (eg __str__ in Python or toString in Java) are removed
Duplicates and near duplicates functions are removed, in order to keep only one version of the function

Who are the source language producers?

OpenSource contributors produced the code and documentations.

The dataset was gatherered and preprocessed automatically.

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

Each example in the dataset has is extracted from a GitHub repository, and each repository has its own license. Example-wise license information is not (yet) included in this dataset: you will need to find out yourself which license the code is using.

Citation Information

@article{husain2019codesearchnet, title={{CodeSearchNet} challenge: Evaluating the state of semantic code search}, author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc}, journal={arXiv preprint arXiv:1909.09436}, year={2019} }

Contributions

Thanks to @SBrandeis for adding this dataset.

作者:

佚名

数据集大小:

4.77 GB