数据集:
codeparrot/codeparrot-valid-near-deduplication
A dataset of Python files from Github. We performed near deduplication of this dataset split codeparrot-clean-train from codeparrot-clean . Exact deduplication can miss a fair amount of nearly identical files. We used MinHash with a Jaccard threshold (default=0.85) to create duplicate clusters. Then these clusters are reduced to unique files based on the exact Jaccard similarity. Fore more details, please refer to this repo .