数据集:
wikitext
语言:
en计算机处理:
monolingual大小:
1M<n<10M语言创建人:
crowdsourced批注创建人:
no-annotation源数据集:
original预印本库:
arxiv:1609.07843WikiText是一个语言建模数据集,包含了从维基百科的优秀和精选文章中提取的1亿多个标记。该数据集基于创作共用许可证发布。
与Penn Treebank(PTB)的预处理版本相比,WikiText-2的规模增加了2倍以上,WikiText-103的规模增加了110倍以上。WikiText数据集还具有更大的词汇量,并保留了原始大小写、标点符号和数字,而这些都在PTB中被删除。由于由整篇文章组成,该数据集非常适合能够利用长期依赖关系的模型。
每个子集都有两个不同的变体:
'validation'的示例如下所示。
This example was too long and was cropped: { "text": "\" The gold dollar or gold one @-@ dollar piece was a coin struck as a regular issue by the United States Bureau of the Mint from..." }wikitext-103-v1
'train'的示例如下所示。
This example was too long and was cropped: { "text": "\" Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to..." }wikitext-2-raw-v1
'train'的示例如下所示。
This example was too long and was cropped: { "text": "\" The Sinclair Scientific Programmable was introduced in 1975 , with the same case as the Sinclair Oxford . It was larger than t..." }wikitext-2-v1
'train'的示例如下所示。
This example was too long and was cropped: { "text": "\" Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to..." }
所有拆分之间的数据字段是相同的。
wikitext-103-raw-v1name | train | validation | test |
---|---|---|---|
wikitext-103-raw-v1 | 1801350 | 3760 | 4358 |
wikitext-103-v1 | 1801350 | 3760 | 4358 |
wikitext-2-raw-v1 | 36718 | 3760 | 4358 |
wikitext-2-v1 | 36718 | 3760 | 4358 |
该数据集可以根据 Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0) 获取。
@misc{merity2016pointer, title={Pointer Sentinel Mixture Models}, author={Stephen Merity and Caiming Xiong and James Bradbury and Richard Socher}, year={2016}, eprint={1609.07843}, archivePrefix={arXiv}, primaryClass={cs.CL} }
感谢 @thomwolf 、 @lewtun 、 @patrickvonplaten 和 @mariamabarham 添加此数据集。