数据集:
koutch/staqc
StaQC (Stack Overflow Question-Code pairs) is a large dataset of around 148K Python and 120K SQL domain question-code pairs, which are automatically mined from Stack Overflow using a Bi-View Hierarchical Neural Network. StaQC is collected from three sources: multi-code answer posts, single-code answer posts, and manual annotations on multi-code answer posts.
The dataset was originally released by the main authors on GitHub . This version is a non-modified redistributed copy (under the license permission) made available on the hub for easier access.
Standalone solutionsAs noted in the paper, the authors define a code snippet as a code solution when the questioner can solve the problem solely based on it (also named as “standalone” solution).
Manual annotationsThe manual annotations are the collection of multi-code answer posts for which each code snippet was annotated with a boolean indicating whether or not the snippet is a standalone solution to the question.
Multi-code answer postsA Multi-code answer post is an (accepted) answer post that contains multiple code snippets, some of which may not be a standalone code solution to the question (see Section 1 in paper ). For example, in this multi-code answer post , the third code snippet is not a code solution to the question "How to limit a number to be within a specified range? (Python)".
Note: the multi-code answer posts contain also the manual annotations.
Single-code answer postsA Single-code answer post is an (accepted) answer post that contains only one code snippet. We pair such code snippets with the question title as a question-code pair.
This dataset can be used for Natural Language to Code Generation tasks.
Python, SQL, English
Each configuration correspond to one of the three parts, in a given programming language.
There are three parts for the dataset:
And two programming/query languages:
One can obtain obtain a configuration as a combination of a part in a programing language. For instance, one can obtain the automatically mined multi-code answers in python using:
dataset = load_dataset("koutch/staqc", 'mca_python') DatasetDict({ train: Dataset({ features: ['id', 'question_id', 'question', 'snippet'], num_rows: 40391 }) })
or the manual annotations using:
dataset = load_dataset("koutch/staqc", 'man_sql') DatasetDict({ train: Dataset({ features: ['id', 'question_id', 'question', 'snippet'], num_rows: 1587 }) })Manual annotations
The manual annotations contain, for a given stackoverflow questions, for each individual code block in the accepted answer of that post, information on whether or not the given code block is a standalone solution to the question asked (the question title).
{ 'question_id': 5947137, 'question': 'How can I use a list comprehension to extend a list in python?', 'snippet': {'text': ['import itertools as it\n\nreturn sum(it.imap(doSomething, originalList), [])\n', 'return sum(map(doSomething, originalList), [])\n', 'return sum((doSomething(x) for x in originalList), [])\n', 'accumulationList = []\nfor x in originalList:\n accumulationList.extend(doSomething(x))\nreturn accumulationList\n'], 'is_sda': [True, True, True, True]} }Multi-code answer posts
{ 'question_id': 35349290, 'question': 'Python: Generating YYMM string between two dates', 'snippet': ['start_year = 2005\nend_year = 2007\nstart_month = 3\nend_month = 2\nyymm = [(yy, mm) for yy in range(start_year, end_year + 1) for mm in range(1, 13)\n if (start_year, start_month) <= (yy, mm) <= (end_year, end_month)]\n', "formatted_yymm = ['{:>02}{:>02}.mat'.format(yy % 100, mm) for yy, mm in yymm]\n"] }Single-code answer posts
{ 'question_id': 19387200, 'question': 'Python: get OS language', 'snippet': "import locale\nloc = locale.getlocale() # get current locale\nlocale.getdefaultlocale() # Tries to determine the default locale settings and returns them as a tuple of the form (language code, encoding); e.g, ('en_US', 'UTF-8').\n" }
Each configuration of the dataset contains only a training split.
StackOverflow data dump.
See section 2.3 "Annotating QC Pairs for Model Training" of the paper
This work is licensed under a Creative Commons Attribution 4.0 International License .
If you use the dataset or the code in your research, please cite the following paper:
@inproceedings{yao2018staqc, title={StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow}, author={Yao, Ziyu and Weld, Daniel S and Chen, Wei-Peng and Sun, Huan}, booktitle={Proceedings of the 2018 World Wide Web Conference on World Wide Web}, pages={1693--1703}, year={2018}, organization={International World Wide Web Conferences Steering Committee} }
I did not contribute to the creation of this dataset, only to the redistribution. All credits should be attributed to the original authors.