CodeXGLUE Clone-detection-BigCloneBench dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench
Given two codes as the input, the task is to do binary classification (0/1), where 1 stands for semantic equivalence and 0 for others. Models are evaluated by F1 score. The dataset we use is BigCloneBench and filtered following the paper Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree.
An example of 'test' looks as follows.
{
"func1": " @Test(expected = GadgetException.class)\n public void malformedGadgetSpecIsCachedAndThrows() throws Exception {\n HttpRequest request = createCacheableRequest();\n expect(pipeline.execute(request)).andReturn(new HttpResponse(\"malformed junk\")).once();\n replay(pipeline);\n try {\n specFactory.getGadgetSpec(createContext(SPEC_URL, false));\n fail(\"No exception thrown on bad parse\");\n } catch (GadgetException e) {\n }\n specFactory.getGadgetSpec(createContext(SPEC_URL, false));\n }\n",
"func2": " public InputStream getInputStream() throws TGBrowserException {\n try {\n if (!this.isFolder()) {\n URL url = new URL(this.url);\n InputStream stream = url.openStream();\n return stream;\n }\n } catch (Throwable throwable) {\n throw new TGBrowserException(throwable);\n }\n return null;\n }\n",
"id": 0,
"id1": 2381663,
"id2": 4458076,
"label": false
}
In the following each data field in go is explained for each config. The data fields are the same among all splits.
default| field name | type | description |
|---|---|---|
| id | int32 | Index of the sample |
| id1 | int32 | The first function id |
| id2 | int32 | The second function id |
| func1 | string | The full text of the first function |
| func2 | string | The full text of the second function |
| label | bool | 1 is the functions are not equivalent, 0 otherwise |
| name | train | validation | test |
|---|---|---|---|
| default | 901028 | 415416 | 415416 |
[More Information Needed]
Data was mined from the IJaDataset 2.0 dataset. [More Information Needed]
Who are the source language producers?[More Information Needed]
Data was manually labeled by three judges by automatically identifying potential clones using search heuristics. [More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
Most of the clones are type 1 and 2 with type 3 and especially type 4 being rare.
[More Information Needed]
[More Information Needed]
https://github.com/microsoft , https://github.com/madlag
Computational Use of Data Agreement (C-UDA) License.
@inproceedings{svajlenko2014towards,
title={Towards a big data curated benchmark of inter-project code clones},
author={Svajlenko, Jeffrey and Islam, Judith F and Keivanloo, Iman and Roy, Chanchal K and Mia, Mohammad Mamun},
booktitle={2014 IEEE International Conference on Software Maintenance and Evolution},
pages={476--480},
year={2014},
organization={IEEE}
}
@inproceedings{wang2020detecting,
title={Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree},
author={Wang, Wenhan and Li, Ge and Ma, Bo and Xia, Xin and Jin, Zhi},
booktitle={2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)},
pages={261--271},
year={2020},
organization={IEEE}
}
Thanks to @madlag (and partly also @ncoop57) for adding this dataset.