数据集:

imdb

任务:

文本分类

子任务:

sentiment-classification

语言:

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

expert-generated

批注创建人:

expert-generated

源数据集:

original

许可:

other

数据集介绍文件清单

英文

"imdb" 数据集卡片

数据集概述

电影评论数据集。这是一个用于二元情感分类的数据集，包含比以前的基准数据集更多的数据。我们提供了25,000个极性极强的电影评论作为训练数据，另外还有25,000个用于测试的评论数据。还有一些未标记的数据可以使用。

支持的任务和排行榜

More Information Needed

语言

More Information Needed

数据集结构

数据实例

plain_text

下载的数据集文件大小： 84.13 MB
生成的数据集大小： 133.23 MB
使用的总磁盘空间： 217.35 MB

'train' 的示例如下所示。

{
    "label": 0,
    "text": "Goodbye world2\n"
}

数据字段

所有拆分都具有相同的数据字段。

plain_text

text : 一个字符串特征。
label : 一个分类标签，可能的取值包括 neg （0）， pos （1）。

数据拆分

name	train	unsupervised	test
plain_text	25000	50000	25000

数据集创建

策划原因

More Information Needed

数据来源

初始数据收集和归一化

More Information Needed

谁是源语言的生产者？

More Information Needed

注释

注释过程

More Information Needed

谁是注释者？

More Information Needed

个人和敏感信息

More Information Needed

使用数据的注意事项

其他信息

数据集策划者

More Information Needed

许可信息

More Information Needed

引用信息

@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}

贡献者

感谢 @ghazi-f , @patrickvonplaten , @lhoestq , @thomwolf 添加了该数据集。

作者:

佚名

数据集大小:

14.89 KB

"imdb" 数据集卡片

数据集概述

支持的任务和排行榜

语言

数据集结构

数据实例

数据字段

数据拆分

数据集创建

策划原因

数据来源

注释

个人和敏感信息

使用数据的注意事项

数据的社会影响

偏见讨论

其他已知限制

其他信息

数据集策划者

许可信息

引用信息

贡献者