数据集:

GEM/turku_hockey_data2text

任务:

表格到文本

语言:

计算机处理:

unknown

大小:

size_categories:unknown

语言创建人:

unknown

批注创建人:

expert-created

源数据集:

original

其他:

data-to-text

许可:

cc-by-nc-sa-4.0

数据集介绍文件清单

中文

Dataset Card for GEM/turku_hockey_data2text

Link to Main Data Card

You can find the main data card on the GEM Website .

Dataset Summary

This is a Finnish data-to-text dataset in which the input is structured information about a hockey game and the output a description of the game.

You can load the dataset via:

import datasets
data = datasets.load_dataset('GEM/turku_hockey_data2text')

The data loader can be found here .

website

Website

paper

ACL anthology

authors

Jenna Kanerva, Samuel Rönnqvist, Riina Kekki, Tapio Salakoski, Filip Ginter (TurkuNLP / University of Turku)

Dataset Overview

Where to find the Data and its Documentation

Webpage

Website

Download

Github

Paper

ACL anthology

BibTex

@inproceedings{kanerva2019newsgen,
  Title = {Template-free Data-to-Text Generation of Finnish Sports News},
  Author = {Jenna Kanerva and Samuel R{\"o}nnqvist and Riina Kekki and Tapio Salakoski and Filip Ginter},
  booktitle = {Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa’19)},
  year={2019}
  }

Contact Name

Jenna Kanerva, Filip Ginter

Contact Email

jmnybl@utu.fi , figint@utu.fi

Has a Leaderboard?

Languages and Intended Use

Multilingual?

Covered Dialects

written standard language

Covered Languages

Finnish

Whose Language?

The original news articles are written by professional journalists. The text passages extracted in the annotation may be slightly edited compared to the original language during the corpus annotation.

License

cc-by-nc-sa-4.0: Creative Commons Attribution Non Commercial Share Alike 4.0 International

Intended Use

This dataset was developed as a benchmark for evaluating template-free, machine learning methods on Finnish news generation in the area of ice hockey reporting.

Primary Task

Data-to-Text

Communicative Goal

Describe an event from an ice hockey game based on the given structural data.

Credit

Curation Organization Type(s)

academic

Curation Organization(s)

University of Turku

Dataset Creators

Jenna Kanerva, Samuel Rönnqvist, Riina Kekki, Tapio Salakoski, Filip Ginter (TurkuNLP / University of Turku)

Funding

The project was supported by the Google Digital News Innovation Fund.

Who added the Dataset to GEM?

Jenna Kanerva, Filip Ginter (TurkuNLP / University of Turku)

Dataset Structure

Data Fields

The dataset is constructed of games, where each game is a list of events. If the event was annotated (corresponding sentence was found from the news article), it includes text field with value other than empty string ("").

For each game (dict), there are keys gem_id (string), id (string), news_article (string), and events (list).

For each event (dict), there are different, relevant keys available with non empty values depending on the event type (e.g. goal or penalty). The mandatory keys for each event are event_id (string), event_type (string), text (string, empty string if not annotated), and multi_reference (bool). The keys not relevant for the specific event type are left empty.

The relevant keys in the event dictionary are:

For each event type, the following keys are relevant: event_id : Identifier of the event, unique to the game but not globally, in chronological order (string) event_type : Type of the event, possible values are game result , goal , penalty , or saves (string) text : Natural language description of the event, or empty string if not available (string) multi_reference : Does this event refer to a text passage describing multiple events? (bool)

The rest of the fields are specific to the event type. The relevant fields for each event type are:

game result: event_id : Identifier of the event, unique to the game but not globally, in chronological order (string) event_type : Type of the event (string) home_team : Name of the home team (string) guest_team : Name of the guest team (string) score : Final score of the game, in the form of home–guest (string) periods : Scores for individual periods, each in the form of home–guest score in that period (list of strings) features : Additional features, such as overtime win or shoot out (list of strings) text : Natural language description of the event, or empty string if not available (string) multi_reference : Does this event refer to a text passage describing multiple events? (bool)

goal: event_id : Identifier of the event, unique to the game but not globally, in chronological order (string) event_type : Type of the event (string) player : Name of the player scoring (string) assist : Names of the players assisting, at most two players (list of strings) team : Team scoring with possible values of home or guest (string) team_name : Name of the team scoring (string) score : Score after the goal, in the form of home–guest (string) time : Time of the goal, minutes and seconds from the beginning (string) features : Additional features, such as power play or short-handed goal (list of strings) text : Natural language description of the event, or empty string if not available (string) multi_reference : Does this event refer to a text passage describing multiple events? (bool)

penalty: event_id : Identifier of the event, unique to the game but not globally, in chronological order (string) event_type : Type of the event (string) player : Name of the player getting the penalty (string) team : Team getting the penalty with possible values of home or guest (string) team_name : Name of the team getting the penalty (string) penalty_minutes : Penalty minutes (string) time : Time of the penalty, minutes and seconds from the beginning (string) text : Natural language description of the event, or empty string if not available (string) multi_reference : Does this event refer to a text passage describing multiple events? (bool)

saves: event_id : Identifier of the event, unique to the game but not globally, in chronological order (string) event_type : Type of the event (string) player : Name of the goalkeeper (string) team : Team of the goalkeeper with possible values of home or guest (string) team_name : Name of the team (string) saves : Number of saves in the game (string) text : Natural language description of the event, or empty string if not available (string) multi_reference : Does this event refer to a text passage describing multiple events? (bool)

Text passages describing multiple events (multi_reference):

Some text passages refer to multiple events in such way that separating them to individual statements is not adequate (e.g. "The home team received two penalties towards the end of the first period."). In these cases, multiple events are aligned to the same text passage so that the first event (in chronological order) include the annotated text passage, while the rest of the events referring to the same text passage include the identifier of the first event in the annotated text field (e.g. text : "E4").

Example Instance

{
  'gem_id': 'gem-turku_hockey_data2text-train-0',
  'id': '20061031-TPS-HPK',
  'news_article': 'HPK:n hyvä syysvire jatkuu jääkiekon SM-liigassa. Tiistaina HPK kukisti mainiolla liikkeellä ja tehokkaalla ylivoimapelillä TPS:n vieraissa 1–0 (1–0, 0–0, 0–0).\nHPK hyödynsi ylivoimaa mennen jo ensimmäisessä erässä Mikko Mäenpään maalilla 1–0 -johtoon.\nToisessa ja kolmannessa erässä HPK tarjosi edelleen TPS:lle runsaasti tilanteita, mutta maalia eivät turkulaiset millään ilveellä saaneet. Pahin este oli loistavan pelin Hämeenlinnan maalilla pelannut Mika Oksa.\nTPS:n maalissa Jani Hurme ei osumille mitään mahtanut. Joukkueen suuri yksinäinen kenttäpelaaja oli Kai Nurminen, mutta hänelläkään ei ollut onnea maalitilanteissa.',
  'events':
    {
      'event_id': ['E1', 'E2', 'E3'],
      'event_type': ['game result', 'penalty', 'goal'],
      'text': ['HPK kukisti TPS:n vieraissa 1–0 (1–0, 0–0, 0–0).', '', 'HPK hyödynsi ylivoimaa mennen jo ensimmäisessä erässä Mikko Mäenpään maalilla 1–0 -johtoon.'],
      'home_team': ['TPS', '', ''],
      'guest_team': ['HPK', '', ''],
      'score': ['0–1', '', '0–1'],
      'periods': [['0–1', '0–0', '0–0'], [], []],
      'features': [[], [], ['power play']],
      'player': ['', 'Fredrik Svensson', 'Mikko Mäenpää'],
      'assist': [[], [], ['Jani Keinänen', 'Toni Mäkiaho']],
      'team': ['', 'guest', 'guest'],
      'team_name': ['', 'HPK', 'HPK'],
      'time': ['', '9.28', '14.57'],
      'penalty_minutes': ['', '2', ''],
      'saves': ['', '', ''],
      'multi_reference': [false, false, false]
    }
}

Data Splits

The corpus include 3 splits: train, validation, and test.

Dataset in GEM

Rationale for Inclusion in GEM

Why is the Dataset in GEM?

The dataset was created to develop machine learned text generation models for Finnish ice hockey news, where the generation would reflect the natural language variation found from the game reports written by professional journalists. While the original game reports often include additional information not derivable from the game statistics, the corpus was fully manually curated to remove all such information from the natural language descriptions. The rationale of such curation was to prevent model 'hallucinating' additional facts.

Similar Datasets

yes

Unique Language Coverage

yes

Difference from other GEM datasets

This is the only data2text corpus for Finnish in GEM.

Ability that the Dataset measures

morphological inflection, language variation

GEM-Specific Curation

Modificatied for GEM?

yes

GEM Modifications

data points modified

Modification Details

Structural data was translated into English.

Additional Splits?

Getting Started with the Task

Previous Results

Metrics

BLEU , METEOR , ROUGE , WER

Proposed Evaluation

Automatic evaluation: BLEU, NIST, METEOR, ROUGE-L, CIDEr Manual evaluation: factual mistakes, grammatical errors, minimum edit distance to an acceptable game report (using WER)

Previous results available?

yes

Dataset Curation

Original Curation

Original Curation Rationale

The dataset is designed for text generation (data2text), where the original source of natural language descriptions is news articles written by journalists. While the link between structural data (ice hockey game statistics) and the news articles describing the game was quite weak (news articles including a lot of information not derivable from the statistics, while leaving many events unmentioned), the corpus includes full manual annotation aligning the events extracted from game statistics and the corresponding natural language passages extracted from the news articles.

Each event is manually aligned into a sentence-like passage, and in case a suitable passage was not found, the annotation is left empty (with value None ). The extracted passages were manually modified not to include additional information not derivable from the game statistics, or not considered as world knowledge. The manual curation of passages is designed to prevent model hallucination, i.e. model learning to generate facts not derivable from the input data.

Communicative Goal

Describing the given events (structural data) in natural language, and therefore generating ice hockey game reports.

Sourced from Different Sources

Language Data

How was Language Data Obtained?

Other

Language Producers

The initial data, both game statistics and news articles, were obtained from the Finnish News Agency STT news archives released for academic use ( http://urn.fi/urn:nbn:fi:lb-2019041501 ). The original news articles are written by professional journalists.

We (TurkuNLP) gratefully acknowledge the collaboration of Maija Paikkala, Salla Salmela and Pihla Lehmusjoki from the Finnish News Agency STT while creating the corpus.

Topics Covered

Ice hockey, news

Data Validation

not validated

Was Data Filtered?

algorithmically

Filter Criteria

Include only games, where both game statistics and a news article describing the game were available (based on timestamps and team names).

Structured Annotations

Additional Annotations?

expert created

Number of Raters

Rater Qualifications

Members of the TurkuNLP research group, native speakers of Finnish.

Raters per Training Example

Raters per Test Example

Annotation Service?

Annotation Values

Manual alignment of events and their natural language descriptions. Removing information not derivable from the input data or world knowledge in order to prevent the model 'hallucination'.

Any Quality Control?

validated by data curators

Quality Control Details

Manual inspection of examples during the initial annotation training phrase.

Consent

Any Consent Policy?

yes

Consent Policy Details

The corpus license was agreed with the providers of the source material.

Private Identifying Information (PII)

Contains PII?

yes/very likely

Categories of PII

generic PII

Any PII Identification?

no identification

Maintenance

Any Maintenance Plan?

Broader Social Context

Previous Work on the Social Impact of the Dataset

Usage of Models based on the Data

Impact on Under-Served Communities

Addresses needs of underserved Communities?

Discussion of Biases

Any Documented Social Biases?

Are the Language Producers Representative of the Language?

The dataset represents only written standard language.

Considerations for Using the Data

PII Risks and Liability

Potential PII Risk

None

Licenses

non-commercial use only

Known Technical Limitations

作者:

GEM

数据集大小:

20.67 MB