数据集:
bertin-project/mc4-sampling
子任务:
language-modeling计算机处理:
multilingual语言创建人:
found批注创建人:
no-annotation源数据集:
original预印本库:
arxiv:1910.10683许可:
odc-byThis dataset builds upon the AllenAI version of the original mC4 and adds sampling methods to perform perplexity-based filtering on the fly. Please, refer to BERTIN Project .
The original dataset is mC4, the multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: " https://commoncrawl.org" .
108 languages are available and are reported in the mc4 dataset .
You can load the mC4 subset of any language like this:
from datasets import load_dataset en_mc4 = load_dataset("mc4", "en")
And if you can even specify a list of languages:
from datasets import load_dataset mc4_subset_with_five_languages = load_dataset("mc4", languages=["en", "fr", "es", "de", "zh"])
There are 3 main different ways of getting sampled versions of mc4 using this dataset.
RandomArguably, the simplest of methods. It keeps a document based on a probability threshold we called factor . It defaults to 0.5 for random sampling:
def _should_keep_doc_random(self, doc, factor=None, **kwargs): factor = 0.5 if factor is None else factor return self.rng.uniform() <= factor
The way to use this sampling method is by adding an extra parameter to the instantiation of the dataset:
from datasets import load_dataset mc4random = load_dataset( "bertin-project/mc4-sampling", "es", split="train", streaming=True, sampling_method="random", factor=0.5, ) for sample in mc4random: print(sample) breakGaussian
This sampling method tries to adjust to the underlying distribution while oversampling the central quartiles of the perplexity distribution of the documents in mC4 for a given language. Two parameters control the shape of the approximation, factor (peakness of the exponential function) and width (spread). Default values are selected for Spanish.
def _should_keep_doc_gaussian(self, doc, factor=None, width=None, boundaries=None, **kwargs): perplexity = self.get_perplexity(doc) width = (9 / 2) if width is None else width factor = 0.78 if factor is None else factor median = 662247.50212365 if boundaries is None else boundaries[1] exponential = np.exp((-1 / width) * ((perplexity - median) / median) ** 2) weighted_perplexity = factor * exponential return self.rng.uniform() < weighted_perplexity
In order to use this sampling methods, information about the quartile boundaries of the underlying distribution need to be calculated beforehand and passed in to the instantiation of the dataset. Moreover, the path to a KenLM model (5-gram language model) or an object with a method .score(text:str) -> float need to also be passed in for the calculation of the perplexity value of a document. KenLM can be installed with pip:
pip install https://github.com/kpu/kenlm/archive/master.zip
from datasets import load_dataset mc4gaussian = load_dataset( "bertin-project/mc4-sampling", "es", split="train", streaming=True, sampling_method="gaussian", perplexity_model="./es.arpa.bin", boundaries=[536394.99320948, 662247.50212365, 919250.87225178], factor=0.78, width=9/2, ) for sample in mc4gaussian: print(sample) break
Facebook has created and released 5-gram Kneser-Ney models for 100 languages available to download and use within the KenLM library. To download your own Kneser-Ney language model, chose a language code from the next list:
af,ar,az,be,bg,bn,ca,cs,da,de,el,en,es,et,fa,fi,fr,gu,he,hi,hr,hu,hy,id,is,it,ja,ka,kk,km,kn,ko,lt,lv,mk,ml,mn,mr,my,ne,nl,no,pl,pt,ro,ru,uk,zh
And run the next download command replacing lang with your own language code:
wget http://dl.fbaipublicfiles.com/cc_net/lm/lang.arpa.bin
The stepwise sampling method uses a simple criteria by oversampling from the central quartiles inversely proportionally their range. Only boundaries , factor (strength of the oversampling), and perplexity_model are needed:
def _should_keep_doc_step(self, doc, factor=None, boundaries=None, **kwargs): perplexity = self.get_perplexity(doc) factor = 1.5e5 if factor is None else factor if boundaries is None: boundaries = [536394.99320948, 662247.50212365, 919250.87225178] if perplexity <= boundaries[0]: quartile_range = boundaries[0] elif boundaries[0] < perplexity < boundaries[1]: quartile_range = boundaries[1] - boundaries[0] elif boundaries[1] < perplexity < boundaries[2]: quartile_range = boundaries[2] - boundaries[1] elif perplexity >= boundaries[2]: quartile_range = 10 * boundaries[2] probability = factor / quartile_range return self.rng.uniform() < probability
In order to use this sampling method, a similar invocation is needed:
mc4stepwsie = load_dataset( "bertin-project/mc4-sampling", "es", split="train", streaming=True, sampling_method="stepwise", perplexity_model="./es.arpa.bin", boundaries=[536394.99320948, 662247.50212365, 919250.87225178], factor=1.5e5, ) for sample in mc4stepwsie: print(sample) break
mC4-sampling is mainly intended to pretrain language models and word representations on a budget.
The dataset supports 108 languages.
An example form the en config is:
{'timestamp': '2018-06-24T01:32:39Z', 'text': 'Farm Resources in Plumas County\ Show Beginning Farmer Organizations & Professionals (304)\ There are 304 resources serving Plumas County in the following categories:\ Map of Beginning Farmer Organizations & Professionals serving Plumas County\ Victoria Fisher - Office Manager - Loyalton, CA\ Amy Lynn Rasband - UCCE Plumas-Sierra Administrative Assistant II - Quincy , CA\ Show Farm Income Opportunities Organizations & Professionals (353)\ There are 353 resources serving Plumas County in the following categories:\ Farm Ranch And Forest Retailers (18)\ Map of Farm Income Opportunities Organizations & Professionals serving Plumas County\ Warner Valley Wildlife Area - Plumas County\ Show Farm Resources Organizations & Professionals (297)\ There are 297 resources serving Plumas County in the following categories:\ Map of Farm Resources Organizations & Professionals serving Plumas County\ There are 57 resources serving Plumas County in the following categories:\ Map of Organic Certification Organizations & Professionals serving Plumas County', 'url': 'http://www.californialandcan.org/Plumas/Farm-Resources/'}
The data have several fields:
The same splits as in mC4 are available .
BERTIN Project is releasing this dataset under the same terms AllenAI released mC4, that is, those of the ODC-BY. By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.
To cite this dataset:
@article{BERTIN, author = {Javier De la Rosa y Eduardo G. Ponferrada y Manu Romero y Paulo Villegas y Pablo González de Prado Salas y María Grandury}, title = {{BERTIN}: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling}, journal = {Procesamiento del Lenguaje Natural}, volume = {68}, number = {0}, year = {2022}, keywords = {}, abstract = {The pre-training of large language models usually requires massive amounts of resources, both in terms of computation and data. Frequently used web sources such as Common Crawl might contain enough noise to make this pretraining sub-optimal. In this work, we experiment with different sampling methods from the Spanish version of mC4, and present a novel data-centric technique which we name perplexity sampling that enables the pre-training of language models in roughly half the amount of steps and using one fifth of the data. The resulting models are comparable to the current state-of-the-art, and even achieve better results for certain tasks. Our work is proof of the versatility of Transformers, and paves the way for small teams to train their models on a limited budget.}, issn = {1989-7553}, url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6403}, pages = {13--23} }
If you use this dataset, we would love to hear about it! Reach out on twitter, GitHub, Discord, or shoot us an email.
To cite the original mc4 dataset:
@article{2019t5, author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu}, title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer}, journal = {arXiv e-prints}, year = {2019}, archivePrefix = {arXiv}, eprint = {1910.10683}, }
Dataset contributed by @versae .
Thanks to @dirkgr and @lhoestq for adding the original mC4 dataset.