数据集:
GEM/turku_hockey_data2text
任务:
表格到文本语言:
fi计算机处理:
unknown语言创建人:
unknown批注创建人:
expert-created源数据集:
original其他:
data-to-text许可:
cc-by-nc-sa-4.0You can find the main data card on the GEM Website .
This is a Finnish data-to-text dataset in which the input is structured information about a hockey game and the output a description of the game.
You can load the dataset via:
import datasets data = datasets.load_dataset('GEM/turku_hockey_data2text')
The data loader can be found here .
website paper authorsJenna Kanerva, Samuel Rönnqvist, Riina Kekki, Tapio Salakoski, Filip Ginter (TurkuNLP / University of Turku)
@inproceedings{kanerva2019newsgen, Title = {Template-free Data-to-Text Generation of Finnish Sports News}, Author = {Jenna Kanerva and Samuel R{\"o}nnqvist and Riina Kekki and Tapio Salakoski and Filip Ginter}, booktitle = {Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa’19)}, year={2019} }Contact Name
Jenna Kanerva, Filip Ginter
Contact Emailjmnybl@utu.fi , figint@utu.fi
Has a Leaderboard?no
no
Covered Dialectswritten standard language
Covered LanguagesFinnish
Whose Language?The original news articles are written by professional journalists. The text passages extracted in the annotation may be slightly edited compared to the original language during the corpus annotation.
Licensecc-by-nc-sa-4.0: Creative Commons Attribution Non Commercial Share Alike 4.0 International
Intended UseThis dataset was developed as a benchmark for evaluating template-free, machine learning methods on Finnish news generation in the area of ice hockey reporting.
Primary TaskData-to-Text
Communicative GoalDescribe an event from an ice hockey game based on the given structural data.
academic
Curation Organization(s)University of Turku
Dataset CreatorsJenna Kanerva, Samuel Rönnqvist, Riina Kekki, Tapio Salakoski, Filip Ginter (TurkuNLP / University of Turku)
FundingThe project was supported by the Google Digital News Innovation Fund.
Who added the Dataset to GEM?Jenna Kanerva, Filip Ginter (TurkuNLP / University of Turku)
The dataset is constructed of games, where each game is a list of events. If the event was annotated (corresponding sentence was found from the news article), it includes text field with value other than empty string ("").
For each game (dict), there are keys gem_id (string), id (string), news_article (string), and events (list).
For each event (dict), there are different, relevant keys available with non empty values depending on the event type (e.g. goal or penalty). The mandatory keys for each event are event_id (string), event_type (string), text (string, empty string if not annotated), and multi_reference (bool). The keys not relevant for the specific event type are left empty.
The relevant keys in the event dictionary are:
For each event type, the following keys are relevant: event_id : Identifier of the event, unique to the game but not globally, in chronological order (string) event_type : Type of the event, possible values are game result , goal , penalty , or saves (string) text : Natural language description of the event, or empty string if not available (string) multi_reference : Does this event refer to a text passage describing multiple events? (bool)
The rest of the fields are specific to the event type. The relevant fields for each event type are:
game result: event_id : Identifier of the event, unique to the game but not globally, in chronological order (string) event_type : Type of the event (string) home_team : Name of the home team (string) guest_team : Name of the guest team (string) score : Final score of the game, in the form of home–guest (string) periods : Scores for individual periods, each in the form of home–guest score in that period (list of strings) features : Additional features, such as overtime win or shoot out (list of strings) text : Natural language description of the event, or empty string if not available (string) multi_reference : Does this event refer to a text passage describing multiple events? (bool)
goal: event_id : Identifier of the event, unique to the game but not globally, in chronological order (string) event_type : Type of the event (string) player : Name of the player scoring (string) assist : Names of the players assisting, at most two players (list of strings) team : Team scoring with possible values of home or guest (string) team_name : Name of the team scoring (string) score : Score after the goal, in the form of home–guest (string) time : Time of the goal, minutes and seconds from the beginning (string) features : Additional features, such as power play or short-handed goal (list of strings) text : Natural language description of the event, or empty string if not available (string) multi_reference : Does this event refer to a text passage describing multiple events? (bool)
penalty: event_id : Identifier of the event, unique to the game but not globally, in chronological order (string) event_type : Type of the event (string) player : Name of the player getting the penalty (string) team : Team getting the penalty with possible values of home or guest (string) team_name : Name of the team getting the penalty (string) penalty_minutes : Penalty minutes (string) time : Time of the penalty, minutes and seconds from the beginning (string) text : Natural language description of the event, or empty string if not available (string) multi_reference : Does this event refer to a text passage describing multiple events? (bool)
saves: event_id : Identifier of the event, unique to the game but not globally, in chronological order (string) event_type : Type of the event (string) player : Name of the goalkeeper (string) team : Team of the goalkeeper with possible values of home or guest (string) team_name : Name of the team (string) saves : Number of saves in the game (string) text : Natural language description of the event, or empty string if not available (string) multi_reference : Does this event refer to a text passage describing multiple events? (bool)
Text passages describing multiple events (multi_reference):
Some text passages refer to multiple events in such way that separating them to individual statements is not adequate (e.g. "The home team received two penalties towards the end of the first period."). In these cases, multiple events are aligned to the same text passage so that the first event (in chronological order) include the annotated text passage, while the rest of the events referring to the same text passage include the identifier of the first event in the annotated text field (e.g. text : "E4").
Example Instance{ 'gem_id': 'gem-turku_hockey_data2text-train-0', 'id': '20061031-TPS-HPK', 'news_article': 'HPK:n hyvä syysvire jatkuu jääkiekon SM-liigassa. Tiistaina HPK kukisti mainiolla liikkeellä ja tehokkaalla ylivoimapelillä TPS:n vieraissa 1–0 (1–0, 0–0, 0–0).\nHPK hyödynsi ylivoimaa mennen jo ensimmäisessä erässä Mikko Mäenpään maalilla 1–0 -johtoon.\nToisessa ja kolmannessa erässä HPK tarjosi edelleen TPS:lle runsaasti tilanteita, mutta maalia eivät turkulaiset millään ilveellä saaneet. Pahin este oli loistavan pelin Hämeenlinnan maalilla pelannut Mika Oksa.\nTPS:n maalissa Jani Hurme ei osumille mitään mahtanut. Joukkueen suuri yksinäinen kenttäpelaaja oli Kai Nurminen, mutta hänelläkään ei ollut onnea maalitilanteissa.', 'events': { 'event_id': ['E1', 'E2', 'E3'], 'event_type': ['game result', 'penalty', 'goal'], 'text': ['HPK kukisti TPS:n vieraissa 1–0 (1–0, 0–0, 0–0).', '', 'HPK hyödynsi ylivoimaa mennen jo ensimmäisessä erässä Mikko Mäenpään maalilla 1–0 -johtoon.'], 'home_team': ['TPS', '', ''], 'guest_team': ['HPK', '', ''], 'score': ['0–1', '', '0–1'], 'periods': [['0–1', '0–0', '0–0'], [], []], 'features': [[], [], ['power play']], 'player': ['', 'Fredrik Svensson', 'Mikko Mäenpää'], 'assist': [[], [], ['Jani Keinänen', 'Toni Mäkiaho']], 'team': ['', 'guest', 'guest'], 'team_name': ['', 'HPK', 'HPK'], 'time': ['', '9.28', '14.57'], 'penalty_minutes': ['', '2', ''], 'saves': ['', '', ''], 'multi_reference': [false, false, false] } }Data Splits
The corpus include 3 splits: train, validation, and test.
The dataset was created to develop machine learned text generation models for Finnish ice hockey news, where the generation would reflect the natural language variation found from the game reports written by professional journalists. While the original game reports often include additional information not derivable from the game statistics, the corpus was fully manually curated to remove all such information from the natural language descriptions. The rationale of such curation was to prevent model 'hallucinating' additional facts.
Similar Datasetsyes
Unique Language Coverageyes
Difference from other GEM datasetsThis is the only data2text corpus for Finnish in GEM.
Ability that the Dataset measuresmorphological inflection, language variation
yes
GEM Modificationsdata points modified
Modification DetailsStructural data was translated into English.
Additional Splits?no
BLEU , METEOR , ROUGE , WER
Proposed EvaluationAutomatic evaluation: BLEU, NIST, METEOR, ROUGE-L, CIDEr Manual evaluation: factual mistakes, grammatical errors, minimum edit distance to an acceptable game report (using WER)
Previous results available?yes
The dataset is designed for text generation (data2text), where the original source of natural language descriptions is news articles written by journalists. While the link between structural data (ice hockey game statistics) and the news articles describing the game was quite weak (news articles including a lot of information not derivable from the statistics, while leaving many events unmentioned), the corpus includes full manual annotation aligning the events extracted from game statistics and the corresponding natural language passages extracted from the news articles.
Each event is manually aligned into a sentence-like passage, and in case a suitable passage was not found, the annotation is left empty (with value None ). The extracted passages were manually modified not to include additional information not derivable from the game statistics, or not considered as world knowledge. The manual curation of passages is designed to prevent model hallucination, i.e. model learning to generate facts not derivable from the input data.
Communicative GoalDescribing the given events (structural data) in natural language, and therefore generating ice hockey game reports.
Sourced from Different Sourcesno
Other
Language ProducersThe initial data, both game statistics and news articles, were obtained from the Finnish News Agency STT news archives released for academic use ( http://urn.fi/urn:nbn:fi:lb-2019041501 ). The original news articles are written by professional journalists.
We (TurkuNLP) gratefully acknowledge the collaboration of Maija Paikkala, Salla Salmela and Pihla Lehmusjoki from the Finnish News Agency STT while creating the corpus.
Topics CoveredIce hockey, news
Data Validationnot validated
Was Data Filtered?algorithmically
Filter CriteriaInclude only games, where both game statistics and a news article describing the game were available (based on timestamps and team names).
expert created
Number of Raters1
Rater QualificationsMembers of the TurkuNLP research group, native speakers of Finnish.
Raters per Training Example1
Raters per Test Example1
Annotation Service?no
Annotation ValuesManual alignment of events and their natural language descriptions. Removing information not derivable from the input data or world knowledge in order to prevent the model 'hallucination'.
Any Quality Control?validated by data curators
Quality Control DetailsManual inspection of examples during the initial annotation training phrase.
yes
Consent Policy DetailsThe corpus license was agreed with the providers of the source material.
yes/very likely
Categories of PIIgeneric PII
Any PII Identification?no identification
no
no
no
no
Are the Language Producers Representative of the Language?The dataset represents only written standard language.
None
non-commercial use only
Copyright Restrictions on the Language Datanon-commercial use only