数据集:
codeparrot/codecomplex
CodeComplex consists of 4,200 Java codes submitted to programming competitions by human programmers and their complexity labels annotated by a group of algorithm experts.
You can load and iterate through the dataset with the following two lines of code:
from datasets import load_dataset ds = load_dataset("codeparrot/codecomplex", split="train") print(next(iter(ds)))
DatasetDict({ train: Dataset({ features: ['src', 'complexity', 'problem', 'from'], num_rows: 4517 }) })
{'src': 'import java.io.*;\nimport java.math.BigInteger;\nimport java.util.InputMismatchException;...', 'complexity': 'quadratic', 'problem': '1179_B. Tolik and His Uncle', 'from': 'CODEFORCES'}
complexity filed has 7 classes, where each class has around 500 codes each. The seven classes are constant, linear, quadratic, cubic, log(n), nlog(n) and NP-hard.
The dataset only contains a train split.
The authors first collected problem and solution codes in Java from CodeForces and they were inspected by experienced human annotators to label each code by their time complexity. After the labelling, they used different programming experts to verify the class of each data that the human annotators assigned.
@article{JeonBHHK22, author = {Mingi Jeon and Seung-Yeop Baik and Joonghyuk Hahn and Yo-Sub Han and Sang-Ki Ko}, title = {{Deep Learning-based Code Complexity Prediction}}, year = {2022}, }