The tldr_news dataset was constructed by collecting a daily tech newsletter (available here ). Then, for every piece of news, the headline and its corresponding content were extracted. Also, the newsletter contain different sections. We add this extra information to every piece of news.
Such a dataset can be used to train a model to generate a headline from a input piece of text.
There is no official supported tasks nor leaderboard for this dataset. However, it could be used for the following tasks:
en
A data point comprises a "headline" and its corresponding "content". An example is as follows:
{ "headline": "Cana Unveils Molecular Beverage Printer, a ‘Netflix for Drinks’ That Can Make Nearly Any Type of Beverage ", "content": "Cana has unveiled a drink machine that can synthesize almost any drink. The machine uses a cartridge that contains flavor compounds that can be combined to create the flavor of nearly any type of drink. It is about the size of a toaster and could potentially save people from throwing hundreds of containers away every month by allowing people to create whatever drinks they want at home. Around $30 million was spent building Cana’s proprietary hardware platform and chemistry system. Cana plans to start full production of the device and will release pricing by the end of February.", "category": "Science and Futuristic Technology" }
This dataset was obtained by scrapping the collecting all the existing newsletter available here .
Every single newsletter was then processed to extract all the different pieces of news. Then for every collected piece of news the headline and the news content were extracted.
The dataset was has been collected from https://tldr.tech/newsletter .
In order to clean up the samples and to construct a dataset better suited for headline generation we have applied a couple of normalization steps:
The people (or person) behind the https://tldr.tech/ newsletter.
Disclaimers: The dataset was generated from a daily newsletter. The author had no intention for those newsletters to be used as such.
Who are the annotators?The newsletters were written by the people behind TLDR tech .
[Needs More Information]
[Needs More Information]
This dataset only contains tech news. A model trained on such a dataset might not be able to generalize to other domain.
[Needs More Information]
The dataset was obtained by collecting newsletters from this website: https://tldr.tech/newsletter
Thanks to @JulesBelveze for adding this dataset.