Dataset Card for MiniPile

Dataset Description

The MiniPile Challenge for Data-Efficient Language Models

Dataset Summary

MiniPile is a 6GB subset of the deduplicated The Pile corpus . To curate MiniPile, we perform a simple, three-step data filtering process: we (1) infer embeddings for all documents of the Pile, (2) cluster the embedding space using k-means, and (3) filter out low-quality clusters.

The primary motivation for curating MiniPile is that (i) diverse pre-training datasets (like the Pile) are often too large for academic budgets and (ii) most smaller-scale datasets are fairly homogeneous and thereby unrepresentative of contemporary general-purpose language models. MiniPile aims to fill this gap and thereby facilitate data-efficient research on model architectures, training procedures, optimizers, etc.

More details on the MiniPile curation procedure and some pre-training results be found in the MiniPile paper .

For more details on the Pile corpus, we refer the reader to the Pile datasheet .

Languages

English ( EN )

Additional Information

Dataset Curators

MiniPile is a subset of the Pile, curated by Jean Kaddour. The Pile was created by Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy.

Licensing Information

Since MiniPile is a subset of the Pile, the same MIT License holds.

Citation Information

@article{kaddour2023minipile,
  title={The MiniPile Challenge for Data-Efficient Language Models},
  author={Kaddour, Jean},
  journal={arXiv preprint arXiv:2304.08442},
  year={2023}
}

@article{gao2020pile,
  title={The {P}ile: An 800{GB} dataset of diverse text for language modeling},
  author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others},
  journal={arXiv preprint arXiv:2101.00027},
  year={2020}
}

作者:

JeanKaddour

数据集大小:

2.96 GB