数据集:
code_x_glue_cc_defect_detection
任务:
文本分类语言:
code计算机处理:
other-programming-languages大小:
10K<n<100K语言创建人:
found批注创建人:
found源数据集:
original许可:
c-udaCodeXGLUE Defect-detection dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Defect-detection
Given a source code, the task is to identify whether it is an insecure code that may attack software systems, such as resource leaks, use-after-free vulnerabilities and DoS attack. We treat the task as binary classification (0/1), where 1 stands for insecure code and 0 for secure code. The dataset we use comes from the paper Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks. We combine all projects and split 80%/10%/10% for training/dev/test.
An example of 'validation' looks as follows.
{ "commit_id": "aa1530dec499f7525d2ccaa0e3a876dc8089ed1e", "func": "static void filter_mirror_setup(NetFilterState *nf, Error **errp)\n{\n MirrorState *s = FILTER_MIRROR(nf);\n Chardev *chr;\n chr = qemu_chr_find(s->outdev);\n if (chr == NULL) {\n error_set(errp, ERROR_CLASS_DEVICE_NOT_FOUND,\n \"Device '%s' not found\", s->outdev);\n qemu_chr_fe_init(&s->chr_out, chr, errp);", "id": 8, "project": "qemu", "target": true }
In the following each data field in go is explained for each config. The data fields are the same among all splits.
defaultfield name | type | description |
---|---|---|
id | int32 | Index of the sample |
func | string | The source code |
target | bool | 0 or 1 (vulnerability or not) |
project | string | Original project that contains this code |
commit_id | string | Commit identifier in the original project |
name | train | validation | test |
---|---|---|---|
default | 21854 | 2732 | 2732 |
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
https://github.com/microsoft , https://github.com/madlag
Computational Use of Data Agreement (C-UDA) License.
@inproceedings{zhou2019devign, title={Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks}, author={Zhou, Yaqin and Liu, Shangqing and Siow, Jingkai and Du, Xiaoning and Liu, Yang}, booktitle={Advances in Neural Information Processing Systems}, pages={10197--10207}, year={2019}
Thanks to @madlag (and partly also @ncoop57) for adding this dataset.