数据集:
wikitext
语言:
计算机处理:
monolingual大小:
1M<n<10M语言创建人:
crowdsourced批注创建人:
no-annotation源数据集:
original预印本库:
arxiv:1609.07843WikiText是一个语言建模数据集,包含了从维基百科的优秀和精选文章中提取的1亿多个标记。该数据集基于创作共用许可证发布。
与Penn Treebank(PTB)的预处理版本相比,WikiText-2的规模增加了2倍以上,WikiText-103的规模增加了110倍以上。WikiText数据集还具有更大的词汇量,并保留了原始大小写、标点符号和数字,而这些都在PTB中被删除。由于由整篇文章组成,该数据集非常适合能够利用长期依赖关系的模型。
每个子集都有两个不同的变体:
'validation'的示例如下所示。
This example was too long and was cropped:
{
"text": "\" The gold dollar or gold one @-@ dollar piece was a coin struck as a regular issue by the United States Bureau of the Mint from..."
}
'train'的示例如下所示。
This example was too long and was cropped:
{
"text": "\" Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to..."
}
'train'的示例如下所示。
This example was too long and was cropped:
{
"text": "\" The Sinclair Scientific Programmable was introduced in 1975 , with the same case as the Sinclair Oxford . It was larger than t..."
}
'train'的示例如下所示。
This example was too long and was cropped:
{
"text": "\" Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to..."
}
所有拆分之间的数据字段是相同的。
wikitext-103-raw-v1name | train | validation | test |
---|---|---|---|
wikitext-103-raw-v1 | 1801350 | 3760 | 4358 |
wikitext-103-v1 | 1801350 | 3760 | 4358 |
wikitext-2-raw-v1 | 36718 | 3760 | 4358 |
wikitext-2-v1 | 36718 | 3760 | 4358 |
该数据集可以根据 Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0) 获取。
@misc{merity2016pointer,
title={Pointer Sentinel Mixture Models},
author={Stephen Merity and Caiming Xiong and James Bradbury and Richard Socher},
year={2016},
eprint={1609.07843},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
感谢 @thomwolf 、 @lewtun 、 @patrickvonplaten 和 @mariamabarham 添加此数据集。