A dataset of Python files from Github. This is the deduplicated version of the codeparrot .
The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
For more details see the preprocessing script in the transformers repository here .
The dataset is split in a train and validation split used for training and evaluation.
This dataset has ~50GB of code and 5361373 files.
DatasetDict({ train: Dataset({ features: ['repo_name', 'path', 'copies', 'size', 'content', 'license', 'hash', 'line_mean', 'line_max', 'alpha_frac', 'autogenerated'], num_rows: 5361373 }) })