数据集:
ml6team/cnn_dailymail_nl
The Dutch CNN / DailyMail Dataset is a machine-translated version of the English CNN / Dailymail dataset containing just over 300k unique news aticles as written by journalists at CNN and the Daily Mail.
Most information about the dataset can be found on the HuggingFace page of the original English version.
These are the basic steps used to create this dataset (+ some chunking):
load_dataset("cnn_dailymail", '3.0.0')
And this is the HuggingFace translation pipeline:
pipeline( task='translation_en_to_nl', model='Helsinki-NLP/opus-mt-en-nl', tokenizer='Helsinki-NLP/opus-mt-en-nl')
The Dutch CNN/DailyMail dataset follows the same splits as the original English version and has 3 splits: train , validation , and test .
Dataset Split | Number of Instances in Split |
---|---|
Train | 287,113 |
Validation | 13,368 |
Test | 11,490 |