IntroProg is a collection of students' submissions to assignments in various introductory programming courses offered at different universities. Currently, the dataset contains submissions collected from Dublin City University, and the University of Singapore.
DublinThe Dublin programming dataset is a dataset composed of students' submissions to introductory programming assignments at the University of Dublin. Students submitted these programs for multiple programming courses over the duration of three academic years.
SingaporeThe Singapore dataset contains 2442 correct and 1783 buggy program attempts by 361 undergraduate students crediting an introduction to Python programming course at NUS (National University of Singapore).
Similarly to the Most Basic Python Programs (mbpp), the data split can be used to evaluate code generations models.
"Data"The data configuration contains all the submissions as well as an indicator of whether these passed the required test.
"repair": Program refinement/repairThe "repair" configuration of each dataset is a subset of the "data" configuration augmented with educators' annotations on the corrections to the buggy programs. This configuration can be used for the task of program refinement. In Computing Education Research (CER), methods for automatically repairing student programs are used to provide students with feedback and help them debug their code.
"bug": Bug classification[Coming soon]
The assignments were written in Python.
One configuration is defined by one source dataset dublin or singapore and one subconfiguration ("metadata", "data", or "repair"):
[More Information Needed]
[More Information Needed]
Some of the fields are configuration specific
The Dublin dataset is split into a training and validation set. The training set contains the submissions to the assignments written during the academic years 2015-2016, and 2016-2017, while the test set contains programs written during the academic year 2017-2018.
SingaporeThe Singapore dataset only contains a training split, which can be used as a test split for evaluating how your feedback methods perform on an unseen dataset (if, for instance, you train your methods on the Dublin Dataset).
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
The data was released under a GNU Lesser General Public License v3.0 license
@inproceedings{azcona2019user2code2vec, title={user2code2vec: Embeddings for Profiling Students Based on Distributional Representations of Source Code}, author={Azcona, David and Arora, Piyush and Hsiao, I-Han and Smeaton, Alan}, booktitle={Proceedings of the 9th International Learning Analytics & Knowledge Conference (LAK’19)}, year={2019}, organization={ACM} } @inproceedings{DBLP:conf/edm/CleuziouF21, author = {Guillaume Cleuziou and Fr{\'{e}}d{\'{e}}ric Flouvat}, editor = {Sharon I{-}Han Hsiao and Shaghayegh (Sherry) Sahebi and Fran{\c{c}}ois Bouchet and Jill{-}J{\^{e}}nn Vie}, title = {Learning student program embeddings using abstract execution traces}, booktitle = {Proceedings of the 14th International Conference on Educational Data Mining, {EDM} 2021, virtual, June 29 - July 2, 2021}, publisher = {International Educational Data Mining Society}, year = {2021}, timestamp = {Wed, 09 Mar 2022 16:47:22 +0100}, biburl = {https://dblp.org/rec/conf/edm/CleuziouF21.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
[More Information Needed]