数据集:
TurkuNLP/xlsum-fi
This dataset is a DeepL -based machine translation of a part of the English section of the XLSum dataset: https://github.com/csebuetnlp/xl-sum In the present version, only examples where the full version is at most 10x the summary in length are included. We might translate more later.
One example from the Finnish dataset is given below in JSON format.
{ "id": "technology-17657859", "url": "https://www.bbc.com/news/technology-17657859", "title": "Walesin myrskytuulien vuoksi annettu säävaroitus", "summary": "Tuulet voivat yltyä Walesissa myrskytuuliin, ja myrskysää on luvassa koko maahan tällä viikolla.", "text": "Met Office on antanut Walesin ja Englannin kattavan keltaisen tuulivaroituksen keskiviikkoillasta kello 21.00 GMT alkaen. Matkustaminen ja sähkönjakelu todennäköisesti häiriintyvät, ja varoitus on voimassa torstaihin kello 15:00 asti. Puuskat ovat todennäköisesti nopeudeltaan 88 kilometriä tunnissa, ja rannikoilla ja kukkuloilla puuskat voivat nousta jopa 70 kilometriin tunnissa, ja lisäksi voi esiintyä rankkasateita ja myrskyisiä sadekuuroja." }
Follows the XLSum dataset.
Detailed in the paper For this present dataset, only English was used as the source and only examples where the full text is at maximum 10x in length compared to the summary are preserved. This 10x cutoff is naturally measured on English.
Who are the source language producers?Detailed in the paper DeepL was used to machine-translate from English to Finnish
Annotation process Who are the annotators?Due to DeepL terms and conditions, this dataset must not be used for any machine translation work , namely machine translation system development and evaluation of any kind. In general, we wish you do not pair the original English data with the translations except when working on research unrelated to machine translation, so as not to infringe on the terms and conditions.
Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) . Copyright of the dataset contents belongs to the original copyright holders.
If you use any of the datasets, models or code modules, please cite the original XL-Sum paper below as well as acknowledge Filip Ginter and the TurkuNLP group for the Finnish machine translated version.
@inproceedings{hasan-etal-2021-xl, title = "{XL}-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages", author = "Hasan, Tahmid and Bhattacharjee, Abhik and Islam, Md. Saiful and Mubasshir, Kazi and Li, Yuan-Fang and Kang, Yong-Bin and Rahman, M. Sohel and Shahriyar, Rifat", booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-acl.413", pages = "4693--4703", }
Thanks to the creators of the XLSum dataset!