数据集:
py_ast
The dataset consists of parsed ASTs that were used to train and evaluate the DeepSyn tool. The Python programs are collected from GitHub repositories by removing duplicate files, removing project forks (copy of another existing repository), keeping only programs that parse and have at most 30'000 nodes in the AST and we aim to remove obfuscated files
Code Representation, Unsupervised Learning
Python
A typical datapoint contains an AST of a python program, parsed. The main key is ast wherein every program's AST is stored. Each children would have, type which will formulate the type of the node. children which enumerates if a given node has children(non-empty list). value , if the given node has any hardcoded value(else "N/A"). An example would be, ''' [ {"type":"Module","children":[1,4]},{"type":"Assign","children":[2,3]},{"type":"NameStore","value":"x"},{"type":"Num","value":"7"}, {"type":"Print","children":[5]}, {"type":"BinOpAdd","children":[6,7]}, {"type":"NameLoad","value":"x"}, {"type":"Num","value":"1"} ] '''
The data is split into a training and test set. The final split sizes are as follows:
train | validation | |
---|---|---|
py_ast examples | 100000 | 50000 |
[More Information Needed]
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Raychev, V., Bielik, P., and Vechev, M
MIT, BSD and Apache
@InProceedings{OOPSLA ’16, ACM, title = {Probabilistic Model for Code with Decision Trees.}, authors={Raychev, V., Bielik, P., and Vechev, M.}, year={2016} }
@inproceedings{10.1145/2983990.2984041, author = {Raychev, Veselin and Bielik, Pavol and Vechev, Martin}, title = {Probabilistic Model for Code with Decision Trees}, year = {2016}, isbn = {9781450344449}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/2983990.2984041}, doi = {10.1145/2983990.2984041}, booktitle = {Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications}, pages = {731–747}, numpages = {17}, keywords = {Code Completion, Decision Trees, Probabilistic Models of Code}, location = {Amsterdam, Netherlands}, series = {OOPSLA 2016} }
Thanks to @reshinthadithyan for adding this dataset.