数据集:

GEM/cs_restaurants

任务:

对话

语言:

计算机处理:

unknown

大小:

size_categories:unknown

语言创建人:

unknown

批注创建人:

none

源数据集:

original

其他:

dialog-response-generation

许可:

cc-by-sa-4.0

数据集介绍文件清单

中文

Dataset Card for GEM/cs_restaurants

Link to Main Data Card

You can find the main data card on the GEM Website .

Dataset Summary

The Czech Restaurants dataset is a task oriented dialog dataset in which a model needs to verbalize a response that a service agent could provide which is specified through a series of dialog acts. The dataset originated as a translation of an English dataset to test the generation capabilities of an NLG system on a highly morphologically rich language like Czech.

You can load the dataset via:

import datasets
data = datasets.load_dataset('GEM/cs_restaurants')

The data loader can be found here .

website

n/a

paper

Github

authors

Ondrej Dusek and Filip Jurcicek

Dataset Overview

Where to find the Data and its Documentation

Download

Github

Paper

Github

BibTex

@inproceedings{cs_restaurants,
    address = {Tokyo, Japan},
    title = {Neural {Generation} for {Czech}: {Data} and {Baselines}},
    shorttitle = {Neural {Generation} for {Czech}},
    url = {https://www.aclweb.org/anthology/W19-8670/},
    urldate = {2019-10-18},
    booktitle = {Proceedings of the 12th {International} {Conference} on {Natural} {Language} {Generation} ({INLG} 2019)},
    author = {Dušek, Ondřej and Jurčíček, Filip},
    month = oct,
    year = {2019},
    pages = {563--574},
}

Contact Name

Ondrej Dusek

Contact Email

odusek@ufal.mff.cuni.cz

Has a Leaderboard?

Languages and Intended Use

Multilingual?

Covered Dialects

No breakdown of dialects is provided.

Covered Languages

Czech

Whose Language?

Six professional translators produced the outputs

License

cc-by-sa-4.0: Creative Commons Attribution Share Alike 4.0 International

Intended Use

The dataset was created to test neural NLG systems in Czech and their ability to deal with rich morphology.

Primary Task

Dialog Response Generation

Communicative Goal

Producing a text expressing the given intent/dialogue act and all and only the attributes specified in the input meaning representation.

Credit

Curation Organization Type(s)

academic

Curation Organization(s)

Charles University, Prague

Dataset Creators

Ondrej Dusek and Filip Jurcicek

Funding

This research was supported by the Charles University project PRIMUS/19/SCI/10 and by the Ministry of Education, Youth and Sports of the Czech Republic under the grant agreement LK11221. This work used using language resources distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071).

Who added the Dataset to GEM?

Simon Mille wrote the initial data card and Yacine Jernite the data loader. Sebastian Gehrmann migrated the data card and loader to the v2 format.

Dataset Structure

Data Fields

The data is stored in a JSON or CSV format, with identical contents. The data has 4 fields:

da : the input meaning representation/dialogue act (MR)
delex_da : the input MR, delexicalized -- all slot values are replaced with placeholders, such as X-name
text : the corresponding target natural language text (reference)
delex_text : the target text, delexicalized (delexicalization is applied regardless of inflection)

In addition, the data contains a JSON file with all possible inflected forms for all slot values in the dataset ( surface_forms.json ). Each slot -> value entry contains a list of inflected forms for the given value, with the base form (lemma), the inflected form, and a morphological tag .

The same MR is often repeated multiple times with different synonymous reference texts.

Reason for Structure

The data originated as a translation and localization of Wen et al.'s SF restaurant NLG dataset.

How were labels chosen?

The input MRs were collected from Wen et al.'s SF restaurant NLG data and localized by randomly replacing slot values (using a list of Prague restaurant names, neighborhoods etc.).

The generated slot values were then automatically replaced in reference texts in the data.

Example Instance

{
  "input": "inform_only_match(food=Turkish,name='Švejk Restaurant',near='Charles Bridge',price_range=cheap)",
  "target": "Našla jsem pouze jednu levnou restauraci poblíž Karlova mostu , kde podávají tureckou kuchyni , Švejk Restaurant ."
}

Data Splits

Property	Value
Total instances	5,192
Unique MRs	2,417
Unique delexicalized instances	2,752
Unique delexicalized MRs	248

The data is split in a roughly 3:1:1 proportion into training, development and test sections, making sure no delexicalized MR appears in two different parts. On the other hand, most DA types/intents are represented in all data parts.

Splitting Criteria

The creators ensured that after delexicalization of the meaning representation there was no overlap between training and test.

The data is split at a 3:1:1 rate between training, validation, and test.

Dataset in GEM

Rationale for Inclusion in GEM

Why is the Dataset in GEM?

This is one of a few non-English data-to-text datasets, in a well-known domain, but covering a morphologically rich language that is harder to generate since named entities need to be inflected. This makes it harder to apply common techniques such as delexicalization or copy mechanisms.

Similar Datasets

yes

Unique Language Coverage

yes

Difference from other GEM datasets

The dialog acts in this dataset are much more varied than the e2e dataset which is the closest in style.

Ability that the Dataset measures

surface realization

GEM-Specific Curation

Modificatied for GEM?

yes

Additional Splits?

yes

Split Information

5 challenge sets for the Czech Restaurants dataset were added to the GEM evaluation suite.

Data shift: We created subsets of the training and development sets of 500 randomly selected inputs each.

Scrambling: We applied input scrambling on a subset of 500 randomly selected test instances; the order of the input dialogue acts was randomly reassigned.

We identified different subsets of the test set that we could compare to each other so that we would have a better understanding of the results. There are currently two selections that we have made:

The first comparison is based on input size: the number of predicates differs between different inputs, ranging from 1 to 5. The table below provides an indication of the distribution of inputs with a particular length. It is clear from the table that this distribution is not balanced, and comparisions between items should be done with caution. Particularly for input size 4 and 5, there may not be enough data to draw reliable conclusions.

Input length	Number of inputs
1	183
2	267
3	297
4	86
5	9

The second comparison is based on the type of act. Again we caution against comparing the different groups that have relatively few items. It is probably OK to compare inform and ?request , but the other acts are all low-frequent.

Act	Frequency
?request	149
inform	609
?confirm	22
inform_only_match	16
inform_no_match	34
?select	12

Split Motivation

Generalization and robustness.

Getting Started with the Task

Technical Terms

utterance: something a system or user may say in a turn
meaning representation: a representation of meaning that the system should be in accordance with. The specific type of MR in this dataset are dialog acts which describe what a dialog system should do, e.g., inform a user about a value.

Previous Results

Measured Model Abilities

Surface realization

Metrics

BLEU , ROUGE , METEOR

Proposed Evaluation

This dataset uses the suite of word-overlap-based automatic metrics from the E2E NLG Challenge (BLEU, NIST, ROUGE-L, METEOR, and CIDEr). In addition, the slot error rate is measured.

Previous results available?

Dataset Curation

Original Curation

Original Curation Rationale

The dataset was created to test neural NLG systems in Czech and their ability to deal with rich morphology.

Communicative Goal

Producing a text expressing the given intent/dialogue act and all and only the attributes specified in the input MR.

Sourced from Different Sources

Language Data

How was Language Data Obtained?

Created for the dataset

Creation Process

Six professional translators translated the underlying dataset with the following instructions:

Each utterance should be translated by itself
fluent spoken-style Czech should be produced
Facts should be preserved
If possible, synonyms should be varied to create diverse utterances
Entity names should be inflected as necessary
the reader of the generated text should be addressed using formal form and self-references should use the female form.

The translators did not have access to the meaning representation.

Data Validation

validated by data curator

Was Data Filtered?

not filtered

Structured Annotations

Additional Annotations?

none

Annotation Service?

Consent

Any Consent Policy?

Justification for Using the Data

It was not explicitly stated but we can safely assume that the translators agreed to this use of their data.

Private Identifying Information (PII)

Contains PII?

no PII

Justification for no PII

This dataset does not include any information about individuals.

Maintenance

Any Maintenance Plan?

Broader Social Context

Previous Work on the Social Impact of the Dataset

Usage of Models based on the Data

Impact on Under-Served Communities

Addresses needs of underserved Communities?

yes

Details on how Dataset Addresses the Needs

The dataset may help improve NLG methods for morphologically rich languages beyond Czech.

Discussion of Biases

Any Documented Social Biases?

yes

Links and Summaries of Analysis Work

To ensure consistency of translation, the data always uses formal/polite address for the user, and uses the female form for first-person self-references (as if the dialogue agent producing the sentences was female). This prevents data sparsity and ensures consistent results for systems trained on the dataset, but does not represent all potential situations arising in Czech.

Considerations for Using the Data

PII Risks and Liability

Licenses

open license - commercial use allowed

Known Technical Limitations

Technical Limitations

The test set may lead users to over-estimate the performance of their NLG systems with respect to their generalisability, because there are no unseen restaurants or addresses in the test set. This is something we will look into for future editions of the GEM shared task.

作者:

GEM

数据集大小:

46.88 KB