数据集:

notional/notional-python

语言:

py

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original
中文

Dataset Card for notional-python

Dataset Summary

The Notional-python dataset contains python code files from 100 well-known repositories gathered from Google Bigquery Github Dataset. The dataset was created to test the ability of programming language models. Follow our repo to do the model evaluation using notional-python dataset.

Languages

Python

Dataset Creation

Curation Rationale

Notional-python was built to provide a dataset for testing the ability of the machine to generate python code.

Source Data

Initial Data Collection and Normalization

The data was obtained by filtering code from Google Bigquery Github data In order to improve the quality of the dataset, only python code files that meet the below conditions are added to the dataset:

  • Code with more than 60% of executable lines
  • Code with logic, not config files or comment-only files
  • Code with more than 30% of attribute declaration lines (E.G.: Some files contain just only class names and their class attributes, usually used for configuration of the project, these files were not selected)
  • Code without TODO and FIXME .
Who are the source language producers?

The producers are users of github.