数据集:

GEM/RotoWire_English-German

任务:

表格到文本

语言:

计算机处理:

unknown

大小:

size_categories:unknown

语言创建人:

unknown

批注创建人:

automatically-created

源数据集:

original

其他:

data-to-text

许可:

cc-by-4.0

数据集介绍文件清单

中文

Dataset Card for GEM/RotoWire_English-German

Link to Main Data Card

You can find the main data card on the GEM Website .

Dataset Summary

This dataset is a data-to-text dataset in the basketball domain. The input are tables in a fixed format with statistics about a game (in English) and the target is a German translation of the originally English description. The translations were done by professional translators with basketball experience. The dataset can be used to evaluate the cross-lingual data-to-text capabilities of a model with complex inputs.

You can load the dataset via:

import datasets
data = datasets.load_dataset('GEM/RotoWire_English-German')

The data loader can be found here .

website

Website

paper

ACL Anthology

authors

Graham Neubig (Carnegie Mellon University), Hiroaki Hayashi (Carnegie Mellon University)

Dataset Overview

Where to find the Data and its Documentation

Webpage

Website

Download

Github

Paper

ACL Anthology

BibTex

@inproceedings{hayashi-etal-2019-findings,
    title = "Findings of the Third Workshop on Neural Generation and Translation",
    author = "Hayashi, Hiroaki  and
      Oda, Yusuke  and
      Birch, Alexandra  and
      Konstas, Ioannis  and
      Finch, Andrew  and
      Luong, Minh-Thang  and
      Neubig, Graham  and
      Sudoh, Katsuhito",
    booktitle = "Proceedings of the 3rd Workshop on Neural Generation and Translation",
    month = nov,
    year = "2019",
    address = "Hong Kong",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/D19-5601",
    doi = "10.18653/v1/D19-5601",
    pages = "1--14",
    abstract = "This document describes the findings of the Third Workshop on Neural Generation and Translation, held in concert with the annual conference of the Empirical Methods in Natural Language Processing (EMNLP 2019). First, we summarize the research trends of papers presented in the proceedings. Second, we describe the results of the two shared tasks 1) efficient neural machine translation (NMT) where participants were tasked with creating NMT systems that are both accurate and efficient, and 2) document generation and translation (DGT) where participants were tasked with developing systems that generate summaries from structured data, potentially with assistance from text in another language.",
}

Contact Name

Hiroaki Hayashi

Contact Email

hiroakih@andrew.cmu.edu

Has a Leaderboard?

Languages and Intended Use

Multilingual?

yes

Covered Languages

English , German

License

cc-by-4.0: Creative Commons Attribution 4.0 International

Intended Use

Foster the research on document-level generation technology and contrast the methods for different types of inputs.

Primary Task

Data-to-Text

Communicative Goal

Describe a basketball game given its box score table (and possibly a summary in a foreign language).

Credit

Curation Organization Type(s)

academic

Curation Organization(s)

Carnegie Mellon University

Dataset Creators

Graham Neubig (Carnegie Mellon University), Hiroaki Hayashi (Carnegie Mellon University)

Funding

Graham Neubig

Who added the Dataset to GEM?

Hiroaki Hayashi (Carnegie Mellon University)

Dataset Structure

Data Fields

id ( string ): The identifier from the original dataset.
gem_id ( string ): The identifier from GEMv2.
day ( string ): Date of the game (Format: MM_DD_YY )
home_name ( string ): Home team name.
home_city ( string ): Home team city name.
vis_name ( string ): Visiting (Away) team name.
vis_city ( string ): Visiting team (Away) city name.
home_line ( Dict[str, str] ): Home team statistics (e.g., team free throw percentage).
vis_line ( Dict[str, str] ): Visiting team statistics (e.g., team free throw percentage).
box_score ( Dict[str, Dict[str, str]] ): Box score table. (Stat_name to [player ID to stat_value].)
summary_en ( List[string] ): Tokenized target summary in English.
sentence_end_index_en ( List[int] ): Sentence end indices for summary_en .
summary_de ( List[string] ): Tokenized target summary in German.
sentence_end_index_de ( List[int] ): ): Sentence end indices for summary_de .
(Unused) detok_summary_org ( string ): Original summary provided by RotoWire dataset.
(Unused) summary ( List[string] ): Tokenized summary of detok_summary_org .
(Unused) detok_summary ( string ): Detokenized (with organizer's detokenizer) summary of summary .

Reason for Structure

Structured data are directly imported from the original RotoWire dataset.
Textual data (English, German) are associated to each sample.

Example Instance

{
  'id': '11_02_16-Jazz-Mavericks-TheUtahJazzdefeatedthe',
  'gem_id': 'GEM-RotoWire_English-German-train-0'
  'day': '11_02_16',
  'home_city': 'Utah',
  'home_name': 'Jazz',
  'vis_city': 'Dallas',
  'vis_name': 'Mavericks',
  'home_line': {
    'TEAM-FT_PCT': '58', ...
  },
  'vis_line': {
    'TEAM-FT_PCT': '80', ...
  },
  'box_score': {
    'PLAYER_NAME': {
      '0': 'Harrison Barnes', ...
  }, ...
  'summary_en': ['The', 'Utah', 'Jazz', 'defeated', 'the', 'Dallas', 'Mavericks', ...],
  'sentence_end_index_en': [16, 52, 100, 137, 177, 215, 241, 256, 288],
  'summary_de': ['Die', 'Utah', 'Jazz', 'besiegten', 'am', 'Mittwoch', 'in', 'der', ...],
  'sentence_end_index_de': [19, 57, 107, 134, 170, 203, 229, 239, 266],
  'detok_summary_org': "The Utah Jazz defeated the Dallas Mavericks 97 - 81 ...",
  'detok_summary': "The Utah Jazz defeated the Dallas Mavericks 97-81 ...",
  'summary': ['The', 'Utah', 'Jazz', 'defeated', 'the', 'Dallas', 'Mavericks', ...],
}

Data Splits

Train
Validation
Test

Splitting Criteria

English summaries are provided sentence-by-sentence to professional German translators with basketball knowledge to obtain sentence-level German translations.
Split criteria follows the original RotoWire dataset.

The (English) summary length in the training set varies from 145 to 650 words, with an average of 323 words.

Dataset in GEM

Rationale for Inclusion in GEM

Why is the Dataset in GEM?

The use of two modalities (data, foreign text) to generate a document-level text summary.

Similar Datasets

yes

Unique Language Coverage

yes

Difference from other GEM datasets

The potential use of two modalities (data, foreign text) as input.

Ability that the Dataset measures

Translation
Data-to-text verbalization
Aggregation of the two above.

GEM-Specific Curation

Modificatied for GEM?

yes

GEM Modifications

other

Modification Details

Added GEM ID in each sample.
Normalize the number of players in each sample with "N/A" for consistent data loading.

Additional Splits?

Getting Started with the Task

Pointers to Resources

Technical Terms

Data-to-text
Neural machine translation (NMT)
Document-level generation and translation (DGT)

Previous Results

Measured Model Abilities

Textual accuracy towards the gold-standard summary.
Content faithfulness to the input structured data.

Metrics

BLEU , ROUGE , Other: Other Metrics

Other Metrics

Model-based measures proposed by (Wiseman et al., 2017):

Relation Generation
Content Selection
Content Ordering

Proposed Evaluation

To evaluate the fidelity of the generated content to the input data.

Previous results available?

yes

Other Evaluation Approaches

N/A.

Relevant Previous Results

See Table 2 to 7 of ( https://aclanthology.org/D19-5601 ) for previous results for this dataset.

Dataset Curation

Original Curation

Original Curation Rationale

A random subset of RotoWire dataset was chosen for German translation annotation.

Communicative Goal

Foster the research on document-level generation technology and contrast the methods for different types of inputs.

Sourced from Different Sources

yes

Source Details

RotoWire

Language Data

How was Language Data Obtained?

Created for the dataset

Creation Process

Professional German language translators were hired to translate basketball summaries from a subset of RotoWire dataset.

Language Producers

Translators are familiar with basketball terminology.

Topics Covered

Basketball (NBA) game summaries.

Data Validation

validated by data curator

Data Preprocessing

Sentence-level translations were aligned back to the original English summary sentences.

Was Data Filtered?

not filtered

Structured Annotations

Additional Annotations?

automatically created

Annotation Service?

Annotation Values

Sentence-end indices for the tokenized summaries. Sentence boundaries can help users accurately identify aligned sentences in both languages, as well as allowing an accurate evaluation that involves sentence boundaries (ROUGE-L).

Any Quality Control?

validated through automated script

Quality Control Details

Token and number overlaps between pairs of aligned sentences are measured.

Consent

Any Consent Policy?

Justification for Using the Data

Reusing by citing the original papers:

Sam Wiseman, Stuart M. Shieber, Alexander M. Rush: Challenges in Data-to-Document Generation. EMNLP 2017.
Hiroaki Hayashi, Yusuke Oda, Alexandra Birch, Ioannis Konstas, Andrew Finch, Minh-Thang Luong, Graham Neubig, Katsuhito Sudoh. Findings of the Third Workshop on Neural Generation and Translation. WNGT 2019.

Private Identifying Information (PII)

Contains PII?

unlikely

Categories of PII

generic PII

Any PII Identification?

no identification

Maintenance

Any Maintenance Plan?

Broader Social Context

Previous Work on the Social Impact of the Dataset

Usage of Models based on the Data

Impact on Under-Served Communities

Addresses needs of underserved Communities?

Discussion of Biases

Any Documented Social Biases?

Are the Language Producers Representative of the Language?

English text in this dataset is from Rotowire, originally written by writers at Rotowire.com that are likely US-based.
German text is produced by professional translators proficient in both English and German.

Considerations for Using the Data

PII Risks and Liability

Potential PII Risk

Structured data contain real National Basketball Association player and organization names.

Licenses

open license - commercial use allowed

Known Technical Limitations

Technical Limitations

Potential overlap of box score tables between splits. This was extensively studied and pointed out by [1].

[1]: Thomson, Craig, Ehud Reiter, and Somayajulu Sripada. "SportSett: Basketball-A robust and maintainable data-set for Natural Language Generation." Proceedings of the Workshop on Intelligent Information Processing and Natural Language Generation. 2020.

Unsuited Applications

Users may interact with a trained model to learn about a NBA game in a textual manner. On generated texts, they may observe factual errors that contradicts the actual data that the model conditions on. Factual errors include wrong statistics of a player (e.g., 3PT), non-existent injury information.

Discouraged Use Cases

Publishing the generated text as is. Even if the model achieves high scores on the evaluation metrics, there is a risk of factual errors mentioned above.

作者:

GEM

数据集大小:

14.31 MB