数据集:
GEM/RotoWire_English-German
任务:
表格到文本计算机处理:
unknown语言创建人:
unknown批注创建人:
automatically-created源数据集:
original其他:
data-to-text许可:
cc-by-4.0You can find the main data card on the GEM Website .
This dataset is a data-to-text dataset in the basketball domain. The input are tables in a fixed format with statistics about a game (in English) and the target is a German translation of the originally English description. The translations were done by professional translators with basketball experience. The dataset can be used to evaluate the cross-lingual data-to-text capabilities of a model with complex inputs.
You can load the dataset via:
import datasets data = datasets.load_dataset('GEM/RotoWire_English-German')
The data loader can be found here .
website paper authorsGraham Neubig (Carnegie Mellon University), Hiroaki Hayashi (Carnegie Mellon University)
@inproceedings{hayashi-etal-2019-findings, title = "Findings of the Third Workshop on Neural Generation and Translation", author = "Hayashi, Hiroaki and Oda, Yusuke and Birch, Alexandra and Konstas, Ioannis and Finch, Andrew and Luong, Minh-Thang and Neubig, Graham and Sudoh, Katsuhito", booktitle = "Proceedings of the 3rd Workshop on Neural Generation and Translation", month = nov, year = "2019", address = "Hong Kong", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/D19-5601", doi = "10.18653/v1/D19-5601", pages = "1--14", abstract = "This document describes the findings of the Third Workshop on Neural Generation and Translation, held in concert with the annual conference of the Empirical Methods in Natural Language Processing (EMNLP 2019). First, we summarize the research trends of papers presented in the proceedings. Second, we describe the results of the two shared tasks 1) efficient neural machine translation (NMT) where participants were tasked with creating NMT systems that are both accurate and efficient, and 2) document generation and translation (DGT) where participants were tasked with developing systems that generate summaries from structured data, potentially with assistance from text in another language.", }Contact Name
Hiroaki Hayashi
Contact Emailhiroakih@andrew.cmu.edu
Has a Leaderboard?no
yes
Covered LanguagesEnglish , German
Licensecc-by-4.0: Creative Commons Attribution 4.0 International
Intended UseFoster the research on document-level generation technology and contrast the methods for different types of inputs.
Primary TaskData-to-Text
Communicative GoalDescribe a basketball game given its box score table (and possibly a summary in a foreign language).
academic
Curation Organization(s)Carnegie Mellon University
Dataset CreatorsGraham Neubig (Carnegie Mellon University), Hiroaki Hayashi (Carnegie Mellon University)
FundingGraham Neubig
Who added the Dataset to GEM?Hiroaki Hayashi (Carnegie Mellon University)
{ 'id': '11_02_16-Jazz-Mavericks-TheUtahJazzdefeatedthe', 'gem_id': 'GEM-RotoWire_English-German-train-0' 'day': '11_02_16', 'home_city': 'Utah', 'home_name': 'Jazz', 'vis_city': 'Dallas', 'vis_name': 'Mavericks', 'home_line': { 'TEAM-FT_PCT': '58', ... }, 'vis_line': { 'TEAM-FT_PCT': '80', ... }, 'box_score': { 'PLAYER_NAME': { '0': 'Harrison Barnes', ... }, ... 'summary_en': ['The', 'Utah', 'Jazz', 'defeated', 'the', 'Dallas', 'Mavericks', ...], 'sentence_end_index_en': [16, 52, 100, 137, 177, 215, 241, 256, 288], 'summary_de': ['Die', 'Utah', 'Jazz', 'besiegten', 'am', 'Mittwoch', 'in', 'der', ...], 'sentence_end_index_de': [19, 57, 107, 134, 170, 203, 229, 239, 266], 'detok_summary_org': "The Utah Jazz defeated the Dallas Mavericks 97 - 81 ...", 'detok_summary': "The Utah Jazz defeated the Dallas Mavericks 97-81 ...", 'summary': ['The', 'Utah', 'Jazz', 'defeated', 'the', 'Dallas', 'Mavericks', ...], }Data Splits
The use of two modalities (data, foreign text) to generate a document-level text summary.
Similar Datasetsyes
Unique Language Coverageyes
Difference from other GEM datasetsThe potential use of two modalities (data, foreign text) as input.
Ability that the Dataset measuresyes
GEM Modificationsother
Modification Detailsno
BLEU , ROUGE , Other: Other Metrics
Other MetricsModel-based measures proposed by (Wiseman et al., 2017):
To evaluate the fidelity of the generated content to the input data.
Previous results available?yes
Other Evaluation ApproachesN/A.
Relevant Previous ResultsSee Table 2 to 7 of ( https://aclanthology.org/D19-5601 ) for previous results for this dataset.
A random subset of RotoWire dataset was chosen for German translation annotation.
Communicative GoalFoster the research on document-level generation technology and contrast the methods for different types of inputs.
Sourced from Different Sourcesyes
Source DetailsRotoWire
Created for the dataset
Creation ProcessProfessional German language translators were hired to translate basketball summaries from a subset of RotoWire dataset.
Language ProducersTranslators are familiar with basketball terminology.
Topics CoveredBasketball (NBA) game summaries.
Data Validationvalidated by data curator
Data PreprocessingSentence-level translations were aligned back to the original English summary sentences.
Was Data Filtered?not filtered
automatically created
Annotation Service?no
Annotation ValuesSentence-end indices for the tokenized summaries. Sentence boundaries can help users accurately identify aligned sentences in both languages, as well as allowing an accurate evaluation that involves sentence boundaries (ROUGE-L).
Any Quality Control?validated through automated script
Quality Control DetailsToken and number overlaps between pairs of aligned sentences are measured.
no
Justification for Using the DataReusing by citing the original papers:
unlikely
Categories of PIIgeneric PII
Any PII Identification?no identification
no
no
no
no
Are the Language Producers Representative of the Language?open license - commercial use allowed
Copyright Restrictions on the Language Dataopen license - commercial use allowed
Potential overlap of box score tables between splits. This was extensively studied and pointed out by [1].
[1]: Thomson, Craig, Ehud Reiter, and Somayajulu Sripada. "SportSett: Basketball-A robust and maintainable data-set for Natural Language Generation." Proceedings of the Workshop on Intelligent Information Processing and Natural Language Generation. 2020.
Unsuited ApplicationsUsers may interact with a trained model to learn about a NBA game in a textual manner. On generated texts, they may observe factual errors that contradicts the actual data that the model conditions on. Factual errors include wrong statistics of a player (e.g., 3PT), non-existent injury information.
Discouraged Use CasesPublishing the generated text as is. Even if the model achieves high scores on the evaluation metrics, there is a risk of factual errors mentioned above.