数据集:

wikitext_tl39

任务:

文本生成

填充掩码

子任务:

language-modeling masked-language-modeling

语言:

language:fil

计算机处理:

monolingual

大小:

1M<n<10M

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original

预印本库:

arxiv:1907.00409

许可:

gpl-3.0

数据集介绍文件清单

中文

Dataset Card for WikiText-TL-39

Dataset Summary

Large scale, unlabeled text dataset with 39 Million tokens in the training set. Inspired by the original WikiText Long Term Dependency dataset (Merity et al., 2016). TL means "Tagalog." Published in Cruz & Cheng (2019).

Supported Tasks and Leaderboards

[More Information Needed]

Languages

Filipino/Tagalog

Dataset Structure

Data Instances

[More Information Needed]

Data Fields

text ( str )

The dataset is in plaintext and only has one field ("text") as it is compiled for language modeling.

Data Splits

Split	Documents	Tokens
Train	120,975	39M
Valid	25,919	8M
Test	25,921	8M

Please see the paper for more details on the dataset splits

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Tagalog Wikipedia

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

[More Information Needed]

Contributions

Thanks to @jcblaisecruz02 for adding this dataset.

作者:

佚名

数据集大小:

10.48 KB