Dataset Card for Dutch CNN Dailymail Dataset

Dataset Summary

The Dutch CNN / DailyMail Dataset is a machine-translated version of the English CNN / Dailymail dataset containing just over 300k unique news aticles as written by journalists at CNN and the Daily Mail.

Most information about the dataset can be found on the HuggingFace page of the original English version.

These are the basic steps used to create this dataset (+ some chunking):

load_dataset("cnn_dailymail", '3.0.0')

And this is the HuggingFace translation pipeline:

pipeline(
    task='translation_en_to_nl',
    model='Helsinki-NLP/opus-mt-en-nl',
    tokenizer='Helsinki-NLP/opus-mt-en-nl')

Data Fields

id : a string containing the heximal formated SHA1 hash of the url where the story was retrieved from
article : a string containing the body of the news article
highlights : a string containing the highlight of the article as written by the article author

Data Splits

The Dutch CNN/DailyMail dataset follows the same splits as the original English version and has 3 splits: train , validation , and test .

Dataset Split	Number of Instances in Split
Train	287,113
Validation	13,368
Test	11,490

作者:

ml6team

数据集大小:

524.07 MB