数据集:

enwik8

语言:

en

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original

许可:

mit
中文

Dataset Card for enwik8

Dataset Summary

The enwik8 dataset is the first 100,000,000 (100M) bytes of the English Wikipedia XML dump on Mar. 3, 2006 and is typically used to measure a model's ability to compress data.

Supported Tasks and Leaderboards

A leaderboard for byte-level causal language modelling can be found on paperswithcode

Languages

en

Dataset Structure

Data Instances

  • Size of downloaded dataset files: 36.45 MB
  • Size of the generated dataset: 102.38 MB
  • Total amount of disk used: 138.83 MB
{
   "text": "In [[Denmark]], the [[Freetown Christiania]] was created in downtown [[Copenhagen]]....",
}

Data Fields

The data fields are the same among all sets.

enwik8
  • text : a string feature.
enwik8-raw
  • text : a string feature.

Data Splits

dataset train
enwik8 1128024
enwik8- raw 1

Dataset Creation

Curation Rationale

[Needs More Information]

Source Data

Initial Data Collection and Normalization

The data is just English Wikipedia XML dump on Mar. 3, 2006 split by line for enwik8 and not split by line for enwik8-raw.

Who are the source language producers?

[Needs More Information]

Annotations

Annotation process

[Needs More Information]

Who are the annotators?

[Needs More Information]

Personal and Sensitive Information

[Needs More Information]

Considerations for Using the Data

Social Impact of Dataset

[Needs More Information]

Discussion of Biases

[Needs More Information]

Other Known Limitations

[Needs More Information]

Additional Information

Dataset Curators

[Needs More Information]

Licensing Information

[Needs More Information]

Citation Information

Dataset is not part of a publication, and can therefore not be cited.

Contributions

Thanks to @HallerPatrick for adding this dataset and @mtanghu for updating it.