数据集:
enwik8
The enwik8 dataset is the first 100,000,000 (100M) bytes of the English Wikipedia XML dump on Mar. 3, 2006 and is typically used to measure a model's ability to compress data.
A leaderboard for byte-level causal language modelling can be found on paperswithcode
en
{ "text": "In [[Denmark]], the [[Freetown Christiania]] was created in downtown [[Copenhagen]]....", }
The data fields are the same among all sets.
enwik8dataset | train |
---|---|
enwik8 | 1128024 |
enwik8- raw | 1 |
[Needs More Information]
The data is just English Wikipedia XML dump on Mar. 3, 2006 split by line for enwik8 and not split by line for enwik8-raw.
Who are the source language producers?[Needs More Information]
[Needs More Information]
Who are the annotators?[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
Dataset is not part of a publication, and can therefore not be cited.
Thanks to @HallerPatrick for adding this dataset and @mtanghu for updating it.