数据集:

GEM/cs_restaurants

任务:

对话

语言:

cs

计算机处理:

unknown

语言创建人:

unknown

批注创建人:

none

源数据集:

original
中文

Dataset Card for GEM/cs_restaurants

Link to Main Data Card

You can find the main data card on the GEM Website .

Dataset Summary

The Czech Restaurants dataset is a task oriented dialog dataset in which a model needs to verbalize a response that a service agent could provide which is specified through a series of dialog acts. The dataset originated as a translation of an English dataset to test the generation capabilities of an NLG system on a highly morphologically rich language like Czech.

You can load the dataset via:

import datasets
data = datasets.load_dataset('GEM/cs_restaurants')

The data loader can be found here .

website

n/a

paper

Github

authors

Ondrej Dusek and Filip Jurcicek

Dataset Overview

Where to find the Data and its Documentation

Download

Github

Paper

Github

BibTex
@inproceedings{cs_restaurants,
    address = {Tokyo, Japan},
    title = {Neural {Generation} for {Czech}: {Data} and {Baselines}},
    shorttitle = {Neural {Generation} for {Czech}},
    url = {https://www.aclweb.org/anthology/W19-8670/},
    urldate = {2019-10-18},
    booktitle = {Proceedings of the 12th {International} {Conference} on {Natural} {Language} {Generation} ({INLG} 2019)},
    author = {Dušek, Ondřej and Jurčíček, Filip},
    month = oct,
    year = {2019},
    pages = {563--574},
}
Contact Name

Ondrej Dusek

Contact Email

odusek@ufal.mff.cuni.cz

Has a Leaderboard?

no

Languages and Intended Use

Multilingual?

no

Covered Dialects

No breakdown of dialects is provided.

Covered Languages

Czech

Whose Language?

Six professional translators produced the outputs

License

cc-by-sa-4.0: Creative Commons Attribution Share Alike 4.0 International

Intended Use

The dataset was created to test neural NLG systems in Czech and their ability to deal with rich morphology.

Primary Task

Dialog Response Generation

Communicative Goal

Producing a text expressing the given intent/dialogue act and all and only the attributes specified in the input meaning representation.

Credit

Curation Organization Type(s)

academic

Curation Organization(s)

Charles University, Prague

Dataset Creators

Ondrej Dusek and Filip Jurcicek

Funding

This research was supported by the Charles University project PRIMUS/19/SCI/10 and by the Ministry of Education, Youth and Sports of the Czech Republic under the grant agreement LK11221. This work used using language resources distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071).

Who added the Dataset to GEM?

Simon Mille wrote the initial data card and Yacine Jernite the data loader. Sebastian Gehrmann migrated the data card and loader to the v2 format.

Dataset Structure

Data Fields

The data is stored in a JSON or CSV format, with identical contents. The data has 4 fields:

  • da : the input meaning representation/dialogue act (MR)
  • delex_da : the input MR, delexicalized -- all slot values are replaced with placeholders, such as X-name
  • text : the corresponding target natural language text (reference)
  • delex_text : the target text, delexicalized (delexicalization is applied regardless of inflection)

In addition, the data contains a JSON file with all possible inflected forms for all slot values in the dataset ( surface_forms.json ). Each slot -> value entry contains a list of inflected forms for the given value, with the base form (lemma), the inflected form, and a morphological tag .

The same MR is often repeated multiple times with different synonymous reference texts.

Reason for Structure

The data originated as a translation and localization of Wen et al.'s SF restaurant NLG dataset.

How were labels chosen?

The input MRs were collected from Wen et al.'s SF restaurant NLG data and localized by randomly replacing slot values (using a list of Prague restaurant names, neighborhoods etc.).

The generated slot values were then automatically replaced in reference texts in the data.

Example Instance
{
  "input": "inform_only_match(food=Turkish,name='Švejk Restaurant',near='Charles Bridge',price_range=cheap)",
  "target": "Našla jsem pouze jednu levnou restauraci poblíž Karlova mostu , kde podávají tureckou kuchyni , Švejk Restaurant ."
}
Data Splits
Property Value
Total instances 5,192
Unique MRs 2,417
Unique delexicalized instances 2,752
Unique delexicalized MRs 248

The data is split in a roughly 3:1:1 proportion into training, development and test sections, making sure no delexicalized MR appears in two different parts. On the other hand, most DA types/intents are represented in all data parts.

Splitting Criteria

The creators ensured that after delexicalization of the meaning representation there was no overlap between training and test.

The data is split at a 3:1:1 rate between training, validation, and test.

Dataset in GEM

Rationale for Inclusion in GEM

Why is the Dataset in GEM?

This is one of a few non-English data-to-text datasets, in a well-known domain, but covering a morphologically rich language that is harder to generate since named entities need to be inflected. This makes it harder to apply common techniques such as delexicalization or copy mechanisms.

Similar Datasets

yes

Unique Language Coverage

yes

Difference from other GEM datasets

The dialog acts in this dataset are much more varied than the e2e dataset which is the closest in style.

Ability that the Dataset measures

surface realization

GEM-Specific Curation

Modificatied for GEM?

yes

Additional Splits?

yes

Split Information

5 challenge sets for the Czech Restaurants dataset were added to the GEM evaluation suite.

  • Data shift: We created subsets of the training and development sets of 500 randomly selected inputs each.
  • Scrambling: We applied input scrambling on a subset of 500 randomly selected test instances; the order of the input dialogue acts was randomly reassigned.
  • We identified different subsets of the test set that we could compare to each other so that we would have a better understanding of the results. There are currently two selections that we have made:
  • The first comparison is based on input size: the number of predicates differs between different inputs, ranging from 1 to 5. The table below provides an indication of the distribution of inputs with a particular length. It is clear from the table that this distribution is not balanced, and comparisions between items should be done with caution. Particularly for input size 4 and 5, there may not be enough data to draw reliable conclusions.

    Input length Number of inputs
    1 183
    2 267
    3 297
    4 86
    5 9

    The second comparison is based on the type of act. Again we caution against comparing the different groups that have relatively few items. It is probably OK to compare inform and ?request , but the other acts are all low-frequent.

    Act Frequency
    ?request 149
    inform 609
    ?confirm 22
    inform_only_match 16
    inform_no_match 34
    ?select 12
    Split Motivation

    Generalization and robustness.

    Getting Started with the Task

    Technical Terms
    • utterance: something a system or user may say in a turn
    • meaning representation: a representation of meaning that the system should be in accordance with. The specific type of MR in this dataset are dialog acts which describe what a dialog system should do, e.g., inform a user about a value.

    Previous Results

    Previous Results

    Measured Model Abilities

    Surface realization

    Metrics

    BLEU , ROUGE , METEOR

    Proposed Evaluation

    This dataset uses the suite of word-overlap-based automatic metrics from the E2E NLG Challenge (BLEU, NIST, ROUGE-L, METEOR, and CIDEr). In addition, the slot error rate is measured.

    Previous results available?

    no

    Dataset Curation

    Original Curation

    Original Curation Rationale

    The dataset was created to test neural NLG systems in Czech and their ability to deal with rich morphology.

    Communicative Goal

    Producing a text expressing the given intent/dialogue act and all and only the attributes specified in the input MR.

    Sourced from Different Sources

    no

    Language Data

    How was Language Data Obtained?

    Created for the dataset

    Creation Process

    Six professional translators translated the underlying dataset with the following instructions:

    • Each utterance should be translated by itself
    • fluent spoken-style Czech should be produced
    • Facts should be preserved
    • If possible, synonyms should be varied to create diverse utterances
    • Entity names should be inflected as necessary
    • the reader of the generated text should be addressed using formal form and self-references should use the female form.

    The translators did not have access to the meaning representation.

    Data Validation

    validated by data curator

    Was Data Filtered?

    not filtered

    Structured Annotations

    Additional Annotations?

    none

    Annotation Service?

    no

    Consent

    Any Consent Policy?

    no

    Justification for Using the Data

    It was not explicitly stated but we can safely assume that the translators agreed to this use of their data.

    Private Identifying Information (PII)

    Contains PII?

    no PII

    Justification for no PII

    This dataset does not include any information about individuals.

    Maintenance

    Any Maintenance Plan?

    no

    Broader Social Context

    Previous Work on the Social Impact of the Dataset

    Usage of Models based on the Data

    no

    Impact on Under-Served Communities

    Addresses needs of underserved Communities?

    yes

    Details on how Dataset Addresses the Needs

    The dataset may help improve NLG methods for morphologically rich languages beyond Czech.

    Discussion of Biases

    Any Documented Social Biases?

    yes

    Links and Summaries of Analysis Work

    To ensure consistency of translation, the data always uses formal/polite address for the user, and uses the female form for first-person self-references (as if the dialogue agent producing the sentences was female). This prevents data sparsity and ensures consistent results for systems trained on the dataset, but does not represent all potential situations arising in Czech.

    Considerations for Using the Data

    PII Risks and Liability

    Licenses

    Copyright Restrictions on the Dataset

    open license - commercial use allowed

    Copyright Restrictions on the Language Data

    open license - commercial use allowed

    Known Technical Limitations

    Technical Limitations

    The test set may lead users to over-estimate the performance of their NLG systems with respect to their generalisability, because there are no unseen restaurants or addresses in the test set. This is something we will look into for future editions of the GEM shared task.