数据集:
philschmid/flanv2
just in case it gets deleted.
This is a processed version of the Flan V2 dataset.
I'm not affiliated with the creators, I'm just releasing the files in an easier-to-access format after processing.
The authors of the Flan Collection recommend experimenting with different mixing ratio's of tasks to get optimal results downstream.
This current version I've processed is missing a few datasets compared to the main branch of the flan v2 repo:
Flan 2021 (flan), P3 (t0), Super-Natural Instructions (niv2), Chain-of-thought (cot), and Dialog (dialog)
Instruction data comes in a few formats:
Each combination of the above tasks + formats are saved as a JSONL with following schema {"input": ..., "target": ..., "task": ...}
Everything is saved as a train split