数据集:
gigaword
任务:
摘要生成语言:
en计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found批注创建人:
found源数据集:
extended|gigaword_2003预印本库:
arxiv:1509.00685许可:
mitHeadline-generation on a corpus of article pairs from Gigaword consisting of around 4 million articles. Use the 'org_data' provided by https://github.com/microsoft/unilm/ which is identical to https://github.com/harvardnlp/sent-summary but with better format.
English.
An example of 'train' looks as follows.
{ 'document': "australia 's current account deficit shrunk by a record #.## billion dollars -lrb- #.## billion us -rrb- in the june quarter due to soaring commodity prices , figures released monday showed .", 'summary': 'australian current account deficit narrows sharply' }
The data fields are the same among all splits.
name | train | validation | test |
---|---|---|---|
default | 3803957 | 189651 | 1951 |
From the paper:
For our training set, we pair the headline of each article with its first sentence to create an inputsummary pair. While the model could in theory be trained on any pair, Gigaword contains many spurious headline-article pairs. We therefore prune training based on the following heuristic filters: (1) Are there no non-stop-words in common? (2) Does the title contain a byline or other extraneous editing marks? (3) Does the title have a question mark or colon? After applying these filters, the training set consists of roughly J = 4 million title-article pairs. We apply a minimal preprocessing step using PTB tokenization, lower-casing, replacing all digit characters with #, and replacing of word types seen less than 5 times with UNK. We also remove all articles from the time-period of the DUC evaluation. release. The complete input training vocabulary consists of 119 million word tokens and 110K unique word types with an average sentence size of 31.3 words. The headline vocabulary consists of 31 million tokens and 69K word types with the average title of length 8.3 words (note that this is significantly shorter than the DUC summaries). On average there are 4.6 overlapping word types between the headline and the input; although only 2.6 in the first 75-characters of the input.
Who are the source language producers?From the paper:
For training data for both tasks, we utilize the annotated Gigaword data set (Graff et al., 2003; Napoles et al., 2012), which consists of standard Gigaword, preprocessed with Stanford CoreNLP tools (Manning et al., 2014).
Annotations are inherited from the annotatated Gigaword data set.
Additional information from the paper:
Our model only uses annotations for tokenization and sentence separation, although several of the baselines use parsing and tagging as well.
Who are the annotators?@article{graff2003english, title={English gigaword}, author={Graff, David and Kong, Junbo and Chen, Ke and Maeda, Kazuaki}, journal={Linguistic Data Consortium, Philadelphia}, volume={4}, number={1}, pages={34}, year={2003} } @article{Rush_2015, title={A Neural Attention Model for Abstractive Sentence Summarization}, url={http://dx.doi.org/10.18653/v1/D15-1044}, DOI={10.18653/v1/d15-1044}, journal={Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing}, publisher={Association for Computational Linguistics}, author={Rush, Alexander M. and Chopra, Sumit and Weston, Jason}, year={2015} }
Thanks to @lewtun , @lhoestq , @thomwolf for adding this dataset.