数据集:

e2e_nlg

任务:

文生文

语言:

en

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

crowdsourced

批注创建人:

crowdsourced

源数据集:

original
中文

Dataset Card for End-to-End NLG Challenge

Dataset Summary

The E2E dataset is used for training end-to-end, data-driven natural language generation systems in the restaurant domain, which is ten times bigger than existing, frequently used datasets in this area. The E2E dataset poses new challenges: (1) its human reference texts show more lexical richness and syntactic variation, including discourse phenomena; (2) generating from this set requires content selection. As such, learning from this dataset promises more natural, varied and less template-like system utterances.

E2E is released in the following paper where you can find more details and baseline results: https://arxiv.org/abs/1706.09254

Supported Tasks and Leaderboards

  • text2text-generation-other-meaning-representation-to-text : The dataset can be used to train a model to generate descriptions in the restaurant domain from meaning representations, which consists in taking as input some data about a restaurant and generate a sentence in natural language that presents the different aspects of the data about the restaurant.. Success on this task is typically measured by achieving a high BLEU , NIST , METEOR , Rouge-L , CIDEr . The TGen model (Dusek and Jurcıcek, 2016a) was used a baseline, had the following scores:
BLEU NIST METEOR ROUGE_L CIDEr
BASELINE 0.6593 8.6094 0.4483 0.6850 2.2338

This task has an inactive leaderboard which can be found here and ranks models based on the metrics above.

Languages

The dataset is in english (en).

Dataset Structure

Data Instances

Example of one instance:

{'human_reference': 'The Vaults pub near Café Adriatic has a 5 star rating.  Prices start at £30.',
 'meaning_representation': 'name[The Vaults], eatType[pub], priceRange[more than £30], customer rating[5 out of 5], near[Café Adriatic]'}

Data Fields

  • human_reference : string, the text is natural language that describes the different characteristics in the meaning representation
  • meaning_representation : list of slots and values to generate a description from

Each MR consists of 3–8 attributes (slots), such as name, food or area, and their values.

Data Splits

The dataset is split into training, validation and testing sets (in a 76.5-8.5-15 ratio), keeping a similar distribution of MR and reference text lengths and ensuring that MRs in different sets are distinct.

train validation test
N. Instances 42061 4672 4693

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

[More Information Needed]

Initial Data Collection and Normalization

The data was collected using the CrowdFlower platform and quality-controlled following Novikova et al. (2016).

Who are the source language producers?

[More Information Needed]

Annotations

Following Novikova et al. (2016), the E2E data was collected using pictures as stimuli, which was shown to elicit significantly more natural, more informative, and better phrased human references than textual MRs.

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

@article{dusek.etal2020:csl,
  title = {Evaluating the {{State}}-of-the-{{Art}} of {{End}}-to-{{End Natural Language Generation}}: {{The E2E NLG Challenge}}},
  author = {Du{\v{s}}ek, Ond\v{r}ej and Novikova, Jekaterina and Rieser, Verena},
  year = {2020},
  month = jan,
  volume = {59},
  pages = {123--156},
  doi = {10.1016/j.csl.2019.06.009},
  archivePrefix = {arXiv},
  eprint = {1901.11528},
  eprinttype = {arxiv},
  journal = {Computer Speech \& Language}

Contributions

Thanks to @lhoestq for adding this dataset.