数据集:

GEM/e2e_nlg

语言:

计算机处理:

unknown

大小:

size_categories:unknown

语言创建人:

unknown

批注创建人:

none

源数据集:

original

其他:

data-to-text

许可:

cc-by-sa-4.0

任务:

表格到文本

数据集介绍文件清单

中文

Dataset Card for GEM/e2e_nlg

Link to Main Data Card

You can find the main data card on the GEM Website .

Dataset Summary

The E2E NLG dataset is an English benchmark dataset for data-to-text models that verbalize a set of 2-9 key-value attribute pairs in the restaurant domain. The version used for GEM is the cleaned E2E NLG dataset, which filters examples with hallucinations and outputs that don't fully cover all input attributes.

You can load the dataset via:

import datasets
data = datasets.load_dataset('GEM/e2e_nlg')

The data loader can be found here .

website

Website

paper

First data release , Detailed E2E Challenge writeup , Cleaned E2E version

authors

Jekaterina Novikova, Ondrej Dusek and Verena Rieser

Dataset Overview

Where to find the Data and its Documentation

Webpage

Website

Download

Github

Paper

First data release , Detailed E2E Challenge writeup , Cleaned E2E version

BibTex

@inproceedings{e2e_cleaned,
    address = {Tokyo, Japan},
    title = {Semantic {Noise} {Matters} for {Neural} {Natural} {Language} {Generation}},
    url = {https://www.aclweb.org/anthology/W19-8652/},
    booktitle = {Proceedings of the 12th {International} {Conference} on {Natural} {Language} {Generation} ({INLG} 2019)},
    author = {Dušek, Ondřej and Howcroft, David M and Rieser, Verena},
    year = {2019},
    pages = {421--426},
}

Contact Name

Ondrej Dusek

Contact Email

odusek@ufal.mff.cuni.cz

Has a Leaderboard?

Languages and Intended Use

Multilingual?

Covered Dialects

Dialect-specific data was not collected and the language is general British English.

Covered Languages

English

Whose Language?

The original dataset was collected using the CrowdFlower (now Appen) platform using native English speakers (self-reported). No demographic information was provided, but the collection was geographically limited to English-speaking countries.

License

cc-by-sa-4.0: Creative Commons Attribution Share Alike 4.0 International

Intended Use

The dataset was collected to test neural model on a very well specified realization task.

Primary Task

Data-to-Text

Communicative Goal

Producing a text informing/recommending a restaurant, given all and only the attributes specified on the input.

Credit

Curation Organization Type(s)

academic

Curation Organization(s)

Heriot-Watt University

Dataset Creators

Jekaterina Novikova, Ondrej Dusek and Verena Rieser

Funding

This research received funding from the EPSRC projects DILiGENt (EP/M005429/1) and MaDrIgAL (EP/N017536/1).

Who added the Dataset to GEM?

Simon Mille wrote the initial data card and Yacine Jernite the data loader. Sebastian Gehrmann migrated the data card to the v2 format and moved the data loader to the hub.

Dataset Structure

Data Fields

The data is in a CSV format, with the following fields:

mr -- the meaning representation (MR, input)
ref -- reference, i.e. the corresponding natural-language description (output)

There are additional fields ( fixed , orig_mr ) indicating whether the data was modified in the cleaning process and what was the original MR before cleaning, but these aren't used for NLG.

The MR has a flat structure -- attribute-value pairs are comma separated, with values enclosed in brackets (see example above). There are 8 attributes:

name -- restaurant name
near -- a landmark close to the restaurant
area -- location (riverside, city centre)
food -- food type / cuisine (e.g. Japanese, Indian, English etc.)
eatType -- restaurant type (restaurant, coffee shop, pub)
priceRange -- price range (low, medium, high, <£20, £20-30, >£30)
rating -- customer rating (low, medium, high, 1/5, 3/5, 5/5)
familyFriendly -- is the restaurant family-friendly (yes/no)

The same MR is often repeated multiple times with different synonymous references.

How were labels chosen?

The source MRs were generated automatically at random from a set of valid attribute values. The labels were crowdsourced and are natural language

Example Instance

{
  "input":  "name[Alimentum], area[riverside], familyFriendly[yes], near[Burger King]",
  "target": "Alimentum is a kids friendly place in the riverside area near Burger King." 
}

Data Splits

MRs	Distinct MRs	References
Training	12,568	8,362	33,525
Development	1,484	1,132	4,299
Test	1,847	1,358	4,693
Total	15,899	10,852	42,517

“Distinct MRs” are MRs that remain distinct even if restaurant/place names (attributes name , near ) are delexicalized, i.e., replaced with a placeholder.

Splitting Criteria

The data are divided so that MRs in different splits do not overlap.

Dataset in GEM

Rationale for Inclusion in GEM

Why is the Dataset in GEM?

The E2E dataset is one of the largest limited-domain NLG datasets and is frequently used as a data-to-text generation benchmark. The E2E Challenge included 20 systems of very different architectures, with system outputs available for download.

Similar Datasets

yes

Unique Language Coverage

Difference from other GEM datasets

The dataset is much cleaner than comparable datasets, and it is also a relatively easy task, making for a straightforward evaluation.

Ability that the Dataset measures

surface realization.

GEM-Specific Curation

Modificatied for GEM?

yes

Additional Splits?

yes

Split Information

4 special test sets for E2E were added to the GEM evaluation suite.

We created subsets of the training and development sets of ~500 randomly selected inputs each.

We applied input scrambling on a subset of 500 randomly selected test instances; the order of the input properties was randomly reassigned.

For the input size, we created subpopulations based on the number of restaurant properties in the input.

Input length	Frequency English
2	5
3	120
4	389
5	737
6	1187
7	1406
8	774
9	73
10	2

Split Motivation

Generalization and robustness

Getting Started with the Task

Previous Results

Measured Model Abilities

Surface realization.

Metrics

BLEU , METEOR , ROUGE

Proposed Evaluation

The official evaluation script combines the MT-Eval and COCO Captioning libraries with the following metrics.

BLEU
CIDEr
NIST
METEOR
ROUGE-L

Previous results available?

yes

Other Evaluation Approaches

Most previous results, including the shared task results, used the library provided by the dataset creators. The shared task also conducted a human evaluation using the following two criteria:

Quality : When collecting quality ratings, system outputs were presented to crowd workers together with the corresponding meaning representation, which implies that correctness of the NL utterance relative to the MR should also influence this ranking. The crowd workers were asked: “How do you judge the overall quality of the utterance in terms of its grammatical correctness, fluency, adequacy and other important factors?”
Naturalness : When collecting naturalness ratings, system outputs were presented to crowd workers without the corresponding meaning representation. The crowd workers were asked: “Could the utterance have been produced by a native speaker?”

Relevant Previous Results

The shared task writeup has in-depth evaluations of systems ( https://www.sciencedirect.com/science/article/pii/S0885230819300919 )

Dataset Curation

Original Curation

Original Curation Rationale

The dataset was collected to showcase/test neural NLG models. It is larger and contains more lexical richness and syntactic variation than previous closed-domain NLG datasets.

Communicative Goal

Producing a text informing/recommending a restaurant, given all and only the attributes specified on the input.

Sourced from Different Sources

Language Data

How was Language Data Obtained?

Crowdsourced

Where was it crowdsourced?

Other crowdworker platform

Language Producers

Human references describing the MRs were collected by crowdsourcing on the CrowdFlower (now Appen) platform, with either textual or pictorial MRs as a baseline. The pictorial MRs were used in 20% of cases -- these yield higher lexical variation but introduce noise.

Topics Covered

The dataset is focused on descriptions of restaurants.

Data Validation

validated by data curator

Data Preprocessing

There were basic checks (length, valid characters, repetition).

Was Data Filtered?

algorithmically

Filter Criteria

The cleaned version of the dataset which we are using in GEM was algorithmically filtered. They used regular expressions to match all human-generated references with a more accurate input when attributes were hallucinated or dropped. Additionally, train-test overlap stemming from the transformation was removed. As a result, this data is much cleaner than the original dataset but not perfect (about 20% of instances may have misaligned slots, compared to 40% of the original data.

Structured Annotations

Additional Annotations?

none

Annotation Service?

Consent

Any Consent Policy?

yes

Consent Policy Details

Since a crowdsourcing platform was used, the involved raters waived their rights to the data and are aware that the produced annotations can be publicly released.

Private Identifying Information (PII)

Contains PII?

no PII

Justification for no PII

The dataset is artificial and does not contain any description of people.

Maintenance

Any Maintenance Plan?

Broader Social Context

Previous Work on the Social Impact of the Dataset

Usage of Models based on the Data

Impact on Under-Served Communities

Addresses needs of underserved Communities?

Discussion of Biases

Any Documented Social Biases?

Are the Language Producers Representative of the Language?

The source data is generated randomly, so it should not contain biases. The human references may be biased by the workers' demographic, but that was not investigated upon data collection.

Considerations for Using the Data

PII Risks and Liability

Licenses

open license - commercial use allowed

Known Technical Limitations

Technical Limitations

The cleaned version still has data points with hallucinated or omitted attributes.

Unsuited Applications

The data only pertains to the restaurant domain and the included attributes. A model cannot be expected to handle other domains or attributes.

作者:

GEM

数据集大小:

48.63 KB