数据集:
code_x_glue_tc_text_to_code
任务:
翻译计算机处理:
other-programming-languages大小:
100K<n<1M语言创建人:
found批注创建人:
found源数据集:
original其他:
text-to-code许可:
c-udaCodeXGLUE text-to-code dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/text-to-code
The dataset we use is crawled and filtered from Microsoft Documentation, whose document located at https://github.com/MicrosoftDocs/ .
An example of 'train' looks as follows.
{ "code": "boolean function ( ) { return isParsed ; }", "id": 0, "nl": "check if details are parsed . concode_field_sep Container parent concode_elem_sep boolean isParsed concode_elem_sep long offset concode_elem_sep long contentStartPosition concode_elem_sep ByteBuffer deadBytes concode_elem_sep boolean isRead concode_elem_sep long memMapSize concode_elem_sep Logger LOG concode_elem_sep byte[] userType concode_elem_sep String type concode_elem_sep ByteBuffer content concode_elem_sep FileChannel fileChannel concode_field_sep Container getParent concode_elem_sep byte[] getUserType concode_elem_sep void readContent concode_elem_sep long getOffset concode_elem_sep long getContentSize concode_elem_sep void getContent concode_elem_sep void setDeadBytes concode_elem_sep void parse concode_elem_sep void getHeader concode_elem_sep long getSize concode_elem_sep void parseDetails concode_elem_sep String getType concode_elem_sep void _parseDetails concode_elem_sep String getPath concode_elem_sep boolean verify concode_elem_sep void setParent concode_elem_sep void getBox concode_elem_sep boolean isSmallBox" }
In the following each data field in go is explained for each config. The data fields are the same among all splits.
defaultfield name | type | description |
---|---|---|
id | int32 | Index of the sample |
nl | string | The natural language description of the task |
code | string | The programming source code for the task |
name | train | validation | test |
---|---|---|---|
default | 100000 | 2000 | 2000 |
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
https://github.com/microsoft , https://github.com/madlag
Computational Use of Data Agreement (C-UDA) License.
@article{iyer2018mapping, title={Mapping language to code in programmatic context}, author={Iyer, Srinivasan and Konstas, Ioannis and Cheung, Alvin and Zettlemoyer, Luke}, journal={arXiv preprint arXiv:1808.09588}, year={2018} }
Thanks to @madlag (and partly also @ncoop57) for adding this dataset.