A dataset of Python files from Github. This is the deduplicated version of the codeparrot .
The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
For more details see the preprocessing script in the transformers repository here .
The dataset is split in a train and validation split used for training and evaluation.
This dataset has ~50GB of code and 5361373 files.
DatasetDict({
train: Dataset({
features: ['repo_name', 'path', 'copies', 'size', 'content', 'license', 'hash', 'line_mean', 'line_max', 'alpha_frac', 'autogenerated'],
num_rows: 5361373
})
})