数据集:
tiiuae/falcon-refinedweb
Falcon RefinedWeb is a massive English web dataset built by TII and released under an ODC-By 1.0 license.
See the ? paper on arXiv for more details.
RefinedWeb is built through stringent filtering and large-scale deduplication of CommonCrawl; we found models trained on RefinedWeb to achieve performance in-line or better than models trained on curated datasets, while only relying on web data.
RefinedWeb is also "multimodal-friendly": it contains links and alt texts for images in processed samples.
This public extract should contain 500-650GT depending on the tokenizer you use, and can be enhanced with the curated corpora of your choosing. This public extract is about ~500GB to download, requiring 2.8TB of local storage once unpacked.
from datasets import load_dataset rw = load_dataset("tiiuae/falcon-refinedweb")
RefinedWeb is the main dataset we have used for training the Falcon LLM models:
Falcon RefinedWeb was created to serve as an English large-scale dataset for the pretraining of large language models. It may be used on its own, or augmented with curated sources (e.g., Wikipedia, StackOverflow).
It was built on top of CommonCrawl, leveraging stringent filtering and extensive deduplication.
RefinedWeb is intended to be primarly used as a pretraining dataset for large language models. Practitioners may leverage it for upstream evaluation with a validation loss, but we do not provide any canonical split.
RefinedWeb primarly contains English.
Each data instance corresponds to an individual web page which has been crawled, processed, and deduplicated against all other instances.
This public extract of RefinedWeb contains about 1B instances (968M individual web pages), for a total of 2.8TB of clean text data.
We do not provide any canonical splits for RefinedWeb.
Falcon RefinedWeb is built on-top of CommonCrawl , using the Macrodata Refinement Pipeline, which combines content extraction, filtering heuristics, and deduplication.
In designing RefinedWeb, we abided to the following philosophy:
During its development, we iterated on RefinedWeb by measuring the zero-shot performance of models trained on development version of the dataset. Our main goal was to maximize the performance obtained, bridging the gap between curated and web data. We also manually audited samples to identify potential filtering improvements.
RefinedWeb is built from CommonCrawl dumps. These dumps are constructed from crawling publicly available web pages.
We applied extensive preprocessing and cleaning of the data, using our Macrodata Refinement Pipeline.
We first filter URLs to remove adult content using a blocklist and a score system, we then use trafilatura to extract content from pages, and perform language identification with the fastText classifier from CCNet ( Wenzek et al., 2019 ). After this first preprocessing stage, we filter data using heuristics from MassiveWeb ( Rae et al., 2021 ), and our own line-wise corrections.
Finally, we run extensive deduplication, removing URLs revisited across dumps and performing subsequently fuzzy and exact substring deduplication.
We provide automatically collected annotations for the source url , timestamp of the crawl, original CommonCrawl dump and segment in which the document was found, and image_urls contained in the page.
As RefinedWeb is built upon publicly available web pages, it may contain sensitive information such as emails, phone numbers, or IP addresses. We believe that deduplication may have helped reduced the prevalence of PII in the dataset, but practitioners working with RefinedWeb should take care.
With the open-source release of Falcon RefinedWeb, we aim to increase access to high-quality web data, which has typically been held private by model developers. We believe this release will in turn improve the accessibility and the spread of performant large language models.
As toxic or biased data is prevalent on the internet, it is likely our dataset contains such content. Notably, using the Perspective API, we estimated the prevalence of toxic content in the dataset to be similar to The Pile.
Despite our best efforts to filter content that does not qualify as natural language, and to deduplicate documents, our pipeline may let through documents that may be considered as errors or redundant.
This public extract is made available under an ODC-By 1.0 license; users should also abide to the CommonCrawl ToU .
@article{refinedweb, title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only}, author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay}, journal={arXiv preprint arXiv:2306.01116}, eprint={2306.01116}, eprinttype = {arXiv}, url={https://arxiv.org/abs/2306.01116}, year={2023} }
RefinedWeb is based on CommonCrawl . Their crawler honors opt-out requests in the robots.txt , see the CC FAQ for details.
To remove a document from RefinedWeb, please message falconllm@tii.ae .
falconllm@tii.ae