数据集:

neural_code_search

预印本库:

arxiv:1908.09804

语言创建人:

crowdsourced

批注创建人:

expert-generated

源数据集:

original

任务:

问答

子任务:

extractive-qa

语言:

en

计算机处理:

monolingual
中文

Dataset Card for Neural Code Search

Dataset Summary

Neural-Code-Search-Evaluation-Dataset presents an evaluation dataset consisting of natural language query and code snippet pairs, with the hope that future work in this area can use this dataset as a common benchmark. We also provide the results of two code search models (NCS, UNIF) from recent work.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

EN - English

Dataset Structure

Data Instances

Search Corpus

The search corpus is indexed using all method bodies parsed from the 24,549 GitHub repositories. In total, there are 4,716,814 methods in this corpus. The code search model will find relevant code snippets (i.e. method bodies) from this corpus given a natural language query. In this data release, we will provide the following information for each method in the corpus:

Evaluation Dataset

The evaluation dataset is composed of 287 Stack Overflow question and answer pairs

Data Fields

Search Corpus
  • id: Each method in the corpus has a unique numeric identifier. This ID number will also be referenced in our evaluation dataset.
  • filepath: The file path is in the format of :owner/:repo/relative-file-path-to-the-repo method_name
  • start_line: Starting line number of the method in the file.
  • end_line: Ending line number of the method in the file.
  • url: GitHub link to the method body with commit ID and line numbers encoded.
Evaluation Dataset
  • stackoverflow_id: Stack Overflow post ID.
  • question: Title fo the Stack Overflow post.
  • question_url: URL of the Stack Overflow post.
  • answer: Code snippet answer to the question.

Data Splits

[More Information Needed]

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

The most popular Android repositories on GitHub (ranked by the number of stars) is used to create the search corpus. For each repository that we indexed, we provide the link, specific to the commit that was used.5 In total, there are 24,549 repositories.

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

Dataset provided for research purposes only. Please check dataset license for additional information.

Additional Information

Dataset Curators

Hongyu Li, Seohyun Kim and Satish Chandra

Licensing Information

CC-BY-NC 4.0 (Attr Non-Commercial Inter.)

Citation Information

arXiv:1908.09804 [cs.SE]

Contributions

Thanks to @vinaykudari for adding this dataset.