数据集:
bertin-project/mc4-es-sampled
This dataset is the result of applying perplexity sampling to the Spanish portion of mC4 using mc4-sampling . Please, refer to BERTIN Project .
You can load the mC4 Spanish sampled like this:
from datasets import load_dataset for config in ("random", "stepwise", "gaussian"): mc4es = load_dataset( "bertin-project/mc4-es-sampled", config, split="train", streaming=True ).shuffle(buffer_size=1000) for sample in mc4es: print(config, sample) break
Alternatively, you can bypass the datasets library and quickly download (~1.5hrs, depending on connection) a specific config in the same order used to pre-train BERTIN models in a massive (~200GB) JSON-lines files:
import io import gzip import json import sys import requests from tqdm import tqdm _DATA_URL_TRAIN = "https://huggingface.co/datasets/bertin-project/mc4-es-sampled/resolve/main/mc4-es-train-50M-{config}-shard-{index:04d}-of-{n_shards:04d}.json.gz" def main(config="stepwise"): data_urls = [ _DATA_URL_TRAIN.format( config=config, index=index + 1, n_shards=1024, ) for index in range(1024) ] with open(f"mc4-es-train-50M-{config}.jsonl", "w") as f: for dara_url in tqdm(data_urls): response = requests.get(dara_url) bio = io.BytesIO(response.content) with gzip.open(bio, "rt", encoding="utf8") as g: for line in g: json_line = json.loads(line.strip()) f.write(json.dumps(json_line) + "\ ") if __name__ == "__main__": main(sys.argv[1])
mC4-es-sampled is mainly intended for reproducibility purposes of the BERTIN Project and to pretrain language models and word representations on medium budgets.
The dataset only supports the Spanish language.
An example form the Gaussian config:
{'timestamp': '2018-10-20T06:20:53Z', 'text': 'Ortho HyaluroTop 200 aporta el colágeno y ácido hialurónico que, con la edad, se producen en menor cantidad. La vitamina C promueve la producción de colágeno para mantener la piel sana y protege a las células contra los radicales libres causados ??por la contaminación ambiental y los rayos UV.', 'url': 'https://www.farmaciagaleno.com/orthonat-hyalurotop-200-30-capsulas'}
The data have several fields:
The resulting mC4 subsets for Spanish are reported in this table:
config | train |
---|---|
stepwise | 50M |
random | 50M |
gaussian | 50M |
The split validation is exactly the same as the original mc4 dataset.
This dataset was built from the original mc4 by applying perplexity-sampling via mc4-sampling for Spanish.
Original data by Common Crawl .
AllenAI are releasing this dataset under the terms of ODC-BY. By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.
To cite this dataset ( arXiv ):
@article{BERTIN, author = {Javier De la Rosa y Eduardo G. Ponferrada y Manu Romero y Paulo Villegas y Pablo González de Prado Salas y María Grandury}, title = {{BERTIN}: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling}, journal = {Procesamiento del Lenguaje Natural}, volume = {68}, number = {0}, year = {2022}, keywords = {}, abstract = {The pre-training of large language models usually requires massive amounts of resources, both in terms of computation and data. Frequently used web sources such as Common Crawl might contain enough noise to make this pretraining sub-optimal. In this work, we experiment with different sampling methods from the Spanish version of mC4, and present a novel data-centric technique which we name perplexity sampling that enables the pre-training of language models in roughly half the amount of steps and using one fifth of the data. The resulting models are comparable to the current state-of-the-art, and even achieve better results for certain tasks. Our work is proof of the versatility of Transformers, and paves the way for small teams to train their models on a limited budget.}, issn = {1989-7553}, url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6403}, pages = {13--23} }
If you use this dataset, we would love to hear about it! Reach out on twitter, GitHub, Discord, or shoot us an email.
To cite the original mc4 dataset:
@article{2019t5, author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu}, title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer}, journal = {arXiv e-prints}, year = {2019}, archivePrefix = {arXiv}, eprint = {1910.10683}, }
Dataset contributed by @versae for BERTIN Project.
Thanks to @dirkgr and @lhoestq for adding the original mC4 dataset.