数据集:
code_x_glue_cc_code_refinement
任务:
文生文语言:
code计算机处理:
other-programming-languages大小:
10K<n<100K语言创建人:
found批注创建人:
expert-generated源数据集:
original预印本库:
arxiv:1812.08693其他:
debugging许可:
c-udaCodeXGLUE code-refinement dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-refinement
We use the dataset released by this paper( https://arxiv.org/pdf/1812.08693.pdf ). The source side is a Java function with bugs and the target side is the refined one. All the function and variable names are normalized. Their dataset contains two subsets ( i.e.small and medium) based on the function length.
An example of 'train' looks as follows.
{ "buggy": "public static TYPE_1 init ( java.lang.String name , java.util.Date date ) { TYPE_1 VAR_1 = new TYPE_1 ( ) ; VAR_1 . METHOD_1 ( name ) ; java.util.Calendar VAR_2 = java.util.Calendar.getInstance ( ) ; VAR_2 . METHOD_2 ( date ) ; VAR_1 . METHOD_3 ( VAR_2 ) ; return VAR_1 ; }\n", "fixed": "public static TYPE_1 init ( java.lang.String name , java.util.Date date ) { TYPE_1 VAR_1 = new TYPE_1 ( ) ; VAR_1 . METHOD_1 ( name ) ; java.util.Calendar VAR_2 = null ; if ( date != null ) { VAR_2 = java.util.Calendar.getInstance ( ) ; VAR_2 . METHOD_2 ( date ) ; } VAR_1 . METHOD_3 ( VAR_2 ) ; return VAR_1 ; }\n", "id": 0 }small
An example of 'validation' looks as follows.
{ "buggy": "public java.util.List < TYPE_1 > METHOD_1 ( ) { java.util.ArrayList < TYPE_1 > VAR_1 = new java.util.ArrayList < TYPE_1 > ( ) ; for ( TYPE_2 VAR_2 : VAR_3 ) { VAR_1 . METHOD_2 ( VAR_2 . METHOD_1 ( ) ) ; } return VAR_1 ; } \n", "fixed": "public java.util.List < TYPE_1 > METHOD_1 ( ) { return VAR_1 ; } \n", "id": 0 }
In the following each data field in go is explained for each config. The data fields are the same among all splits.
medium, smallfield name | type | description |
---|---|---|
id | int32 | Index of the sample |
buggy | string | The buggy version of the code |
fixed | string | The correct version of the code |
name | train | validation | test |
---|---|---|---|
medium | 52364 | 6546 | 6545 |
small | 46680 | 5835 | 5835 |
[More Information Needed]
Downloaded from GitHub Archive every public GitHub event between March 2011 and October 2017 and used the Google BigQuery APIs. [More Information Needed]
Who are the source language producers?Software Engineering developers.
Automatically annotated by filtering commit messages containing the pattern: ("fix" or "solve") and ("bug" or "issue" or "problem" or "error"). A statistically significant amount of samples (95% confidence level with 5% confidence interval) were manually evaluated by two authors to check if the filtered bug/fix pairs were correct. After all disagreements were settled, authors conclude that 97.6% were true positives.
Who are the annotators?Heuristics and the authors of the paper.
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
https://github.com/microsoft , https://github.com/madlag
Computational Use of Data Agreement (C-UDA) License.
@article{CodeXGLUE, title={CodeXGLUE: A Benchmark Dataset and Open Challenge for Code Intelligence}, year={2020},}
Thanks to @madlag (and partly also @ncoop57) for adding this dataset.