数据集:
wikitext
语言:
计算机处理:
monolingual大小:
1M<n<10M语言创建人:
crowdsourced批注创建人:
no-annotation源数据集:
original预印本库:
arxiv:1609.07843WikiText是一个语言建模数据集,包含了从维基百科的优秀和精选文章中提取的1亿多个标记。该数据集基于创作共用许可证发布。
与Penn Treebank(PTB)的预处理版本相比,WikiText-2的规模增加了2倍以上,WikiText-103的规模增加了110倍以上。WikiText数据集还具有更大的词汇量,并保留了原始大小写、标点符号和数字,而这些都在PTB中被删除。由于由整篇文章组成,该数据集非常适合能够利用长期依赖关系的模型。
每个子集都有两个不同的变体:
'validation'的示例如下所示。
This example was too long and was cropped:
{
    "text": "\" The gold dollar or gold one @-@ dollar piece was a coin struck as a regular issue by the United States Bureau of the Mint from..."
}
 wikitext-103-v1 'train'的示例如下所示。
This example was too long and was cropped:
{
    "text": "\" Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to..."
}
 wikitext-2-raw-v1 'train'的示例如下所示。
This example was too long and was cropped:
{
    "text": "\" The Sinclair Scientific Programmable was introduced in 1975 , with the same case as the Sinclair Oxford . It was larger than t..."
}
 wikitext-2-v1 'train'的示例如下所示。
This example was too long and was cropped:
{
    "text": "\" Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to..."
}
 所有拆分之间的数据字段是相同的。
wikitext-103-raw-v1| name | train | validation | test | 
|---|---|---|---|
| wikitext-103-raw-v1 | 1801350 | 3760 | 4358 | 
| wikitext-103-v1 | 1801350 | 3760 | 4358 | 
| wikitext-2-raw-v1 | 36718 | 3760 | 4358 | 
| wikitext-2-v1 | 36718 | 3760 | 4358 | 
该数据集可以根据 Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0) 获取。
@misc{merity2016pointer,
      title={Pointer Sentinel Mixture Models},
      author={Stephen Merity and Caiming Xiong and James Bradbury and Richard Socher},
      year={2016},
      eprint={1609.07843},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
 感谢 @thomwolf 、 @lewtun 、 @patrickvonplaten 和 @mariamabarham 添加此数据集。