数据集:
codeparrot/github-jupyter-text-code-pairs
This is a parsed version of github-jupyter-parsed , with markdown and code pairs. We provide the preprocessing script in preprocessing.py . The data is deduplicated and consists of 451662 examples.
For similar datasets with text and Python code, there is CoNaLa benchmark from StackOverflow, with some samples curated by annotators.