数据集:

tiiuae/falcon-refinedweb

任务:

文本生成

语言:

大小:

100B<n<1T

预印本库:

arxiv:2306.01116 arxiv:2203.15556 arxiv:2107.06499

数字对象标识符:

10.57967/hf/0737

许可:

odc-by

数据集介绍文件清单

中文

📀 Falcon RefinedWeb

Falcon RefinedWeb is a massive English web dataset built by TII and released under an ODC-By 1.0 license.

See the 📓 paper on arXiv for more details.

RefinedWeb is built through stringent filtering and large-scale deduplication of CommonCrawl; we found models trained on RefinedWeb to achieve performance in-line or better than models trained on curated datasets, while only relying on web data.

RefinedWeb is also "multimodal-friendly": it contains links and alt texts for images in processed samples.

This public extract should contain 500-650GT depending on the tokenizer you use, and can be enhanced with the curated corpora of your choosing. This public extract is about ~500GB to download, requiring 2.8TB of local storage once unpacked.

from datasets import load_dataset
rw = load_dataset("tiiuae/falcon-refinedweb")

RefinedWeb is the main dataset we have used for training the Falcon LLM models:

It was used in conjunction with a curated corpora to train Falcon- 7B / 40B , two state-of-the-art open-source models.
It was also used to train Falcon-RW- 1B / 7B , two models trained on 350 billion tokens of RefinedWeb alone to demonstrate its quality compared to curated corpora.

Dataset card for Falcon RefinedWeb

Dataset Summary

Falcon RefinedWeb was created to serve as an English large-scale dataset for the pretraining of large language models. It may be used on its own, or augmented with curated sources (e.g., Wikipedia, StackOverflow).

It was built on top of CommonCrawl, leveraging stringent filtering and extensive deduplication.

Supported Tasks and Leaderboards

RefinedWeb is intended to be primarly used as a pretraining dataset for large language models. Practitioners may leverage it for upstream evaluation with a validation loss, but we do not provide any canonical split.

Languages

RefinedWeb primarly contains English.

Dataset Structure

Data Instances

Each data instance corresponds to an individual web page which has been crawled, processed, and deduplicated against all other instances.

This public extract of RefinedWeb contains about 1B instances (968M individual web pages), for a total of 2.8TB of clean text data.

Data Fields

content : the processed and cleaned text contained in the page;
url : the url of the webpage crawled to produce the sample;
timestamp : timestamp of when the webpage was crawled by CommonCrawl;
dump : the CommonCrawl dump the sample is a part of;
segment : the CommonCrawl segment the sample is a part of;
image_urls : a list of elements in the type [ image_url , image_alt_text ] for all the images found in the content of the sample.

Data Splits

We do not provide any canonical splits for RefinedWeb.

Dataset Creation

Curation Rationale

Falcon RefinedWeb is built on-top of CommonCrawl , using the Macrodata Refinement Pipeline, which combines content extraction, filtering heuristics, and deduplication.

In designing RefinedWeb, we abided to the following philosophy:

(1) Scale first. We intend MDR to produce datasets to be used to train 40-200B parameters models, thus requiring trillions of tokens (Hoffmann et al., 2022) . For English-only RefinedWeb, we target a size of 3-6 trillion tokens. Specifically, we eschew any labour intensive human curation process, and focus on CommonCrawl instead of disparate single-domain sources.
(2) Strict deduplication. Inspired by the work of Lee et al., 2021 , which demonstrated the value of deduplication for large language models, we implement a rigorous deduplication pipeline. We combine both exact and fuzzy deduplication, and use strict settings leading to removal rates far higher than others datasets have reported.
(3) Neutral filtering. To avoid introducing further undesirable biases into the model, we avoid using ML-based filtering outside of language identification ( Dodge et al., 2021 ; Welbl et al., 2021 ) . We stick to simple rules and heuristics, and use only URL filtering for adult content.

During its development, we iterated on RefinedWeb by measuring the zero-shot performance of models trained on development version of the dataset. Our main goal was to maximize the performance obtained, bridging the gap between curated and web data. We also manually audited samples to identify potential filtering improvements.

Source Data

RefinedWeb is built from CommonCrawl dumps. These dumps are constructed from crawling publicly available web pages.

Data Collection and Preprocessing

We applied extensive preprocessing and cleaning of the data, using our Macrodata Refinement Pipeline.

We first filter URLs to remove adult content using a blocklist and a score system, we then use trafilatura to extract content from pages, and perform language identification with the fastText classifier from CCNet ( Wenzek et al., 2019 ). After this first preprocessing stage, we filter data using heuristics from MassiveWeb ( Rae et al., 2021 ), and our own line-wise corrections.

Finally, we run extensive deduplication, removing URLs revisited across dumps and performing subsequently fuzzy and exact substring deduplication.

Annotations

We provide automatically collected annotations for the source url , timestamp of the crawl, original CommonCrawl dump and segment in which the document was found, and image_urls contained in the page.

Personal and Sensitive Information

As RefinedWeb is built upon publicly available web pages, it may contain sensitive information such as emails, phone numbers, or IP addresses. We believe that deduplication may have helped reduced the prevalence of PII in the dataset, but practitioners working with RefinedWeb should take care.

Considerations for Using the Data

Social Impact of Dataset

With the open-source release of Falcon RefinedWeb, we aim to increase access to high-quality web data, which has typically been held private by model developers. We believe this release will in turn improve the accessibility and the spread of performant large language models.

Discussion of Biases

As toxic or biased data is prevalent on the internet, it is likely our dataset contains such content. Notably, using the Perspective API, we estimated the prevalence of toxic content in the dataset to be similar to The Pile.

Other Known Limitations

Despite our best efforts to filter content that does not qualify as natural language, and to deduplicate documents, our pipeline may let through documents that may be considered as errors or redundant.

Additional Information

Licensing Information

This public extract is made available under an ODC-By 1.0 license; users should also abide to the CommonCrawl ToU .

Citation Information

@article{refinedweb,
  title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only},
  author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
  journal={arXiv preprint arXiv:2306.01116},
  eprint={2306.01116},
  eprinttype = {arXiv},
  url={https://arxiv.org/abs/2306.01116},
  year={2023}
}

Opt-out request

RefinedWeb is based on CommonCrawl . Their crawler honors opt-out requests in the robots.txt , see the CC FAQ for details.

To remove a document from RefinedWeb, please message falconllm@tii.ae .

Contact

falconllm@tii.ae

作者:

tiiuae

数据集大小:

14.13 GB