数据集:
wikitext_tl39
计算机处理:
monolingual大小:
1M<n<10M语言创建人:
found批注创建人:
no-annotation源数据集:
original预印本库:
arxiv:1907.00409许可:
gpl-3.0Large scale, unlabeled text dataset with 39 Million tokens in the training set. Inspired by the original WikiText Long Term Dependency dataset (Merity et al., 2016). TL means "Tagalog." Published in Cruz & Cheng (2019).
[More Information Needed]
Filipino/Tagalog
[More Information Needed]
The dataset is in plaintext and only has one field ("text") as it is compiled for language modeling.
Split | Documents | Tokens |
---|---|---|
Train | 120,975 | 39M |
Valid | 25,919 | 8M |
Test | 25,921 | 8M |
Please see the paper for more details on the dataset splits
[More Information Needed]
Tagalog Wikipedia
Initial Data Collection and Normalization[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Thanks to @jcblaisecruz02 for adding this dataset.