数据集:

ptb_text_only

任务:

文本生成

填充掩码

子任务:

language-modeling masked-language-modeling

语言:

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

found

批注创建人:

expert-generated

源数据集:

original

许可:

other

数据集介绍文件清单

中文

Dataset Card for Penn Treebank

Dataset Summary

This is the Penn Treebank Project: Release 2 CDROM, featuring a million words of 1989 Wall Street Journal material. The rare words in this version are already replaced with token. The numbers are replaced with token.

Supported Tasks and Leaderboards

Language Modelling

Languages

The text in the dataset is in American English

Dataset Structure

Data Instances

[Needs More Information]

Data Fields

[Needs More Information]

Data Splits

[Needs More Information]

Dataset Creation

Curation Rationale

[Needs More Information]

Source Data

Initial Data Collection and Normalization

[Needs More Information]

Who are the source language producers?

[Needs More Information]

Annotations

Annotation process

[Needs More Information]

Who are the annotators?

[Needs More Information]

Personal and Sensitive Information

[Needs More Information]

Considerations for Using the Data

Social Impact of Dataset

[Needs More Information]

Discussion of Biases

[Needs More Information]

Other Known Limitations

[Needs More Information]

Additional Information

Dataset Curators

[Needs More Information]

Licensing Information

Dataset provided for research purposes only. Please check dataset license for additional information.

Citation Information

@article{marcus-etal-1993-building, title = "Building a Large Annotated Corpus of {E}nglish: The {P}enn {T}reebank", author = "Marcus, Mitchell P. and Santorini, Beatrice and Marcinkiewicz, Mary Ann", journal = "Computational Linguistics", volume = "19", number = "2", year = "1993", url = " https://www.aclweb.org/anthology/J93-2004" , pages = "313--330", }

Contributions

Thanks to @harshalmittal4 for adding this dataset.

作者:

佚名

数据集大小:

13.71 KB