数据集:

TurkuNLP/xlsum-fi

中文

Dataset Card for "XL-Sum-FI"

Dataset Summary

This dataset is a DeepL -based machine translation of a part of the English section of the XLSum dataset: https://github.com/csebuetnlp/xl-sum In the present version, only examples where the full version is at most 10x the summary in length are included. We might translate more later.

Supported Tasks and Leaderboards

Languages

  • finnish

Dataset Structure

Data Instances

One example from the Finnish dataset is given below in JSON format.

{
  "id": "technology-17657859",
  "url": "https://www.bbc.com/news/technology-17657859",
  "title": "Walesin myrskytuulien vuoksi annettu säävaroitus",
  "summary": "Tuulet voivat yltyä Walesissa myrskytuuliin, ja myrskysää on luvassa koko maahan tällä viikolla.",
  "text": "Met Office on antanut Walesin ja Englannin kattavan keltaisen tuulivaroituksen keskiviikkoillasta kello 21.00 GMT alkaen. Matkustaminen ja sähkönjakelu todennäköisesti häiriintyvät, ja varoitus on voimassa torstaihin kello 15:00 asti. Puuskat ovat todennäköisesti nopeudeltaan 88 kilometriä tunnissa, ja rannikoilla ja kukkuloilla puuskat voivat nousta jopa 70 kilometriin tunnissa, ja lisäksi voi esiintyä rankkasateita ja myrskyisiä sadekuuroja."
}

Data Fields

  • 'id': A string representing the article ID, matched to the XLSum dataset original
  • 'url': A string representing the article URL as in the original XLSum dataset
  • 'title': A string containing the article title, machine-translated to Finnish
  • 'summary': A string containing the article summary, machine-translated to Finnish
  • 'text' : A string containing the article text, machine-translated to Finnish

Data Splits

Follows the XLSum dataset.

Dataset Creation

Curation Rationale

Source Data

BBC News

Initial Data Collection and Normalization

Detailed in the paper For this present dataset, only English was used as the source and only examples where the full text is at maximum 10x in length compared to the summary are preserved. This 10x cutoff is naturally measured on English.

Who are the source language producers?

Detailed in the paper

Annotations

Detailed in the paper DeepL was used to machine-translate from English to Finnish

Annotation process

Detailed in the paper

Who are the annotators?

Detailed in the paper

Personal and Sensitive Information

More information needed

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Due to DeepL terms and conditions, this dataset must not be used for any machine translation work , namely machine translation system development and evaluation of any kind. In general, we wish you do not pair the original English data with the translations except when working on research unrelated to machine translation, so as not to infringe on the terms and conditions.

Additional Information

Dataset Curators

Licensing Information

Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) . Copyright of the dataset contents belongs to the original copyright holders.

Citation Information

If you use any of the datasets, models or code modules, please cite the original XL-Sum paper below as well as acknowledge Filip Ginter and the TurkuNLP group for the Finnish machine translated version.

@inproceedings{hasan-etal-2021-xl,
    title = "{XL}-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages",
    author = "Hasan, Tahmid  and
      Bhattacharjee, Abhik  and
      Islam, Md. Saiful  and
      Mubasshir, Kazi  and
      Li, Yuan-Fang  and
      Kang, Yong-Bin  and
      Rahman, M. Sohel  and
      Shahriyar, Rifat",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.413",
    pages = "4693--4703",
}

Contributions

Thanks to the creators of the XLSum dataset!