数据集:
JeanKaddour/minipile
The MiniPile Challenge for Data-Efficient Language Models
MiniPile is a 6GB subset of the deduplicated The Pile corpus . To curate MiniPile, we perform a simple, three-step data filtering process: we (1) infer embeddings for all documents of the Pile, (2) cluster the embedding space using k-means, and (3) filter out low-quality clusters.
The primary motivation for curating MiniPile is that (i) diverse pre-training datasets (like the Pile) are often too large for academic budgets and (ii) most smaller-scale datasets are fairly homogeneous and thereby unrepresentative of contemporary general-purpose language models. MiniPile aims to fill this gap and thereby facilitate data-efficient research on model architectures, training procedures, optimizers, etc.
More details on the MiniPile curation procedure and some pre-training results be found in the MiniPile paper .
For more details on the Pile corpus, we refer the reader to the Pile datasheet .
English ( EN )
MiniPile is a subset of the Pile, curated by Jean Kaddour. The Pile was created by Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy.
Since MiniPile is a subset of the Pile, the same MIT License holds.
@article{kaddour2023minipile, title={The MiniPile Challenge for Data-Efficient Language Models}, author={Kaddour, Jean}, journal={arXiv preprint arXiv:2304.08442}, year={2023} } @article{gao2020pile, title={The {P}ile: An 800{GB} dataset of diverse text for language modeling}, author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others}, journal={arXiv preprint arXiv:2101.00027}, year={2020} }