数据集:
newsroom
任务:
摘要生成语言:
en计算机处理:
monolingual语言创建人:
expert-generated批注创建人:
expert-generated源数据集:
original许可:
otherNEWSROOM is a large dataset for training and evaluating summarization systems. It contains 1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications.
Dataset features includes:
This dataset can be downloaded upon requests. Unzip all the contents "train.jsonl, dev.josnl, test.jsonl" to the tfds folder.
English ( en ).
An example of 'train' looks as follows.
{ "compression": 33.880001068115234, "compression_bin": "medium", "coverage": 1.0, "coverage_bin": "high", "date": "200600000", "density": 11.720000267028809, "density_bin": "extractive", "summary": "some summary 1", "text": "some text 1", "title": "news title 1", "url": "url.html" }
The data fields are the same among all splits.
defaultname | train | validation | test |
---|---|---|---|
default | 995041 | 108837 | 108862 |
https://cornell.qualtrics.com/jfe/form/SV_6YA3HQ2p75XH4IR
This Dataset Usage Agreement ("Agreement") is a legal agreement with the Cornell Newsroom Summaries Team ("Newsroom") for the Dataset made available to the individual or entity ("Researcher") exercising rights under this Agreement. "Dataset" includes all text, data, information, source code, and any related materials, documentation, files, media, updates or revisions.
The Dataset is intended for non-commercial research and educational purposes only, and is made available free of charge without extending any license or other intellectual property rights. By downloading or using the Dataset, the Researcher acknowledges that they agree to the terms in this Agreement, and represent and warrant that they have authority to do so on behalf of any entity exercising rights under this Agreement. The Researcher accepts and agrees to be bound by the terms and conditions of this Agreement. If the Researcher does not agree to this Agreement, they may not download or use the Dataset.
By sharing content with Newsroom, such as by submitting content to this site or by corresponding with Newsroom contributors, the Researcher grants Newsroom the right to use, reproduce, display, perform, adapt, modify, distribute, have distributed, and promote the content in any form, anywhere and for any purpose, such as for evaluating and comparing summarization systems. Nothing in this Agreement shall obligate Newsroom to provide any support for the Dataset. Any feedback, suggestions, ideas, comments, improvements given by the Researcher related to the Dataset is voluntarily given, and may be used by Newsroom without obligation or restriction of any kind.
The Researcher accepts full responsibility for their use of the Dataset and shall defend indemnify, and hold harmless Newsroom, including their employees, trustees, officers, and agents, against any and all claims arising from the Researcher's use of the Dataset. The Researcher agrees to comply with all laws and regulations as they relate to access to and use of the Dataset and Service including U.S. export jurisdiction and other U.S. and international regulations.
THE DATASET IS PROVIDED "AS IS." NEWSROOM DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT. WITHOUT LIMITATION OF THE ABOVE, NEWSROOM DISCLAIMS ANY WARRANTY THAT DATASET IS BUG OR ERROR-FREE, AND GRANTS NO WARRANTY REGARDING ITS USE OR THE RESULTS THEREFROM INCLUDING, WITHOUT LIMITATION, ITS CORRECTNESS, ACCURACY, OR RELIABILITY. THE DATASET IS NOT WARRANTIED TO FULFILL ANY PARTICULAR PURPOSES OR NEEDS.
TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT SHALL NEWSROOM BE LIABLE FOR ANY LOSS, DAMAGE OR INJURY, DIRECT AND INDIRECT, INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER FOR BREACH OF CONTRACT, TORT (INCLUDING NEGLIGENCE) OR OTHERWISE, ARISING OUT OF THIS AGREEMENT, INCLUDING BUT NOT LIMITED TO LOSS OF PROFITS, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. THESE LIMITATIONS SHALL APPLY NOTWITHSTANDING ANY FAILURE OF ESSENTIAL PURPOSE OF ANY LIMITED REMEDY.
This Agreement is effective until terminated. Newsroom reserves the right to terminate the Researcher's access to the Dataset at any time. If the Researcher breaches this Agreement, the Researcher's rights to use the Dataset shall terminate automatically. The Researcher will immediately cease all use and distribution of the Dataset and destroy any copies or portions of the Dataset in their possession.
This Agreement is governed by the laws of the State of New York, without regard to conflict of law principles. All terms and provisions of this Agreement shall, if possible, be construed in a manner which makes them valid, but in the event any term or provision of this Agreement is found by a court of competent jurisdiction to be illegal or unenforceable, the validity or enforceability of the remainder of this Agreement shall not be affected.
This Agreement is the complete and exclusive agreement between the parties with respect to its subject matter and supersedes all prior or contemporaneous oral or written agreements or understandings relating to the subject matter.
@inproceedings{N18-1065, author = {Grusky, Max and Naaman, Mor and Artzi, Yoav}, title = {NEWSROOM: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies}, booktitle = {Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies}, year = {2018}, }
Thanks to @lewtun , @patrickvonplaten , @yoavartzi , @thomwolf for adding this dataset.