You can find the main data card on the GEM Website .
The Czech Restaurants dataset is a task oriented dialog dataset in which a model needs to verbalize a response that a service agent could provide which is specified through a series of dialog acts. The dataset originated as a translation of an English dataset to test the generation capabilities of an NLG system on a highly morphologically rich language like Czech.
You can load the dataset via:
import datasets data = datasets.load_dataset('GEM/cs_restaurants')
The data loader can be found here .
websiten/a
paper authorsOndrej Dusek and Filip Jurcicek
@inproceedings{cs_restaurants, address = {Tokyo, Japan}, title = {Neural {Generation} for {Czech}: {Data} and {Baselines}}, shorttitle = {Neural {Generation} for {Czech}}, url = {https://www.aclweb.org/anthology/W19-8670/}, urldate = {2019-10-18}, booktitle = {Proceedings of the 12th {International} {Conference} on {Natural} {Language} {Generation} ({INLG} 2019)}, author = {Dušek, Ondřej and Jurčíček, Filip}, month = oct, year = {2019}, pages = {563--574}, }Contact Name
Ondrej Dusek
Contact Emailodusek@ufal.mff.cuni.cz
Has a Leaderboard?no
no
Covered DialectsNo breakdown of dialects is provided.
Covered LanguagesCzech
Whose Language?Six professional translators produced the outputs
Licensecc-by-sa-4.0: Creative Commons Attribution Share Alike 4.0 International
Intended UseThe dataset was created to test neural NLG systems in Czech and their ability to deal with rich morphology.
Primary TaskDialog Response Generation
Communicative GoalProducing a text expressing the given intent/dialogue act and all and only the attributes specified in the input meaning representation.
academic
Curation Organization(s)Charles University, Prague
Dataset CreatorsOndrej Dusek and Filip Jurcicek
FundingThis research was supported by the Charles University project PRIMUS/19/SCI/10 and by the Ministry of Education, Youth and Sports of the Czech Republic under the grant agreement LK11221. This work used using language resources distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071).
Who added the Dataset to GEM?Simon Mille wrote the initial data card and Yacine Jernite the data loader. Sebastian Gehrmann migrated the data card and loader to the v2 format.
The data is stored in a JSON or CSV format, with identical contents. The data has 4 fields:
In addition, the data contains a JSON file with all possible inflected forms for all slot values in the dataset ( surface_forms.json ). Each slot -> value entry contains a list of inflected forms for the given value, with the base form (lemma), the inflected form, and a morphological tag .
The same MR is often repeated multiple times with different synonymous reference texts.
Reason for StructureThe data originated as a translation and localization of Wen et al.'s SF restaurant NLG dataset.
How were labels chosen?The input MRs were collected from Wen et al.'s SF restaurant NLG data and localized by randomly replacing slot values (using a list of Prague restaurant names, neighborhoods etc.).
The generated slot values were then automatically replaced in reference texts in the data.
Example Instance{ "input": "inform_only_match(food=Turkish,name='Švejk Restaurant',near='Charles Bridge',price_range=cheap)", "target": "Našla jsem pouze jednu levnou restauraci poblíž Karlova mostu , kde podávají tureckou kuchyni , Švejk Restaurant ." }Data Splits
Property | Value |
---|---|
Total instances | 5,192 |
Unique MRs | 2,417 |
Unique delexicalized instances | 2,752 |
Unique delexicalized MRs | 248 |
The data is split in a roughly 3:1:1 proportion into training, development and test sections, making sure no delexicalized MR appears in two different parts. On the other hand, most DA types/intents are represented in all data parts.
Splitting CriteriaThe creators ensured that after delexicalization of the meaning representation there was no overlap between training and test.
The data is split at a 3:1:1 rate between training, validation, and test.
This is one of a few non-English data-to-text datasets, in a well-known domain, but covering a morphologically rich language that is harder to generate since named entities need to be inflected. This makes it harder to apply common techniques such as delexicalization or copy mechanisms.
Similar Datasetsyes
Unique Language Coverageyes
Difference from other GEM datasetsThe dialog acts in this dataset are much more varied than the e2e dataset which is the closest in style.
Ability that the Dataset measuressurface realization
yes
Additional Splits?yes
Split Information5 challenge sets for the Czech Restaurants dataset were added to the GEM evaluation suite.
The first comparison is based on input size: the number of predicates differs between different inputs, ranging from 1 to 5. The table below provides an indication of the distribution of inputs with a particular length. It is clear from the table that this distribution is not balanced, and comparisions between items should be done with caution. Particularly for input size 4 and 5, there may not be enough data to draw reliable conclusions.
Input length | Number of inputs |
---|---|
1 | 183 |
2 | 267 |
3 | 297 |
4 | 86 |
5 | 9 |
The second comparison is based on the type of act. Again we caution against comparing the different groups that have relatively few items. It is probably OK to compare inform and ?request , but the other acts are all low-frequent.
Act | Frequency |
---|---|
?request | 149 |
inform | 609 |
?confirm | 22 |
inform_only_match | 16 |
inform_no_match | 34 |
?select | 12 |
Generalization and robustness.
Surface realization
MetricsBLEU , ROUGE , METEOR
Proposed EvaluationThis dataset uses the suite of word-overlap-based automatic metrics from the E2E NLG Challenge (BLEU, NIST, ROUGE-L, METEOR, and CIDEr). In addition, the slot error rate is measured.
Previous results available?no
The dataset was created to test neural NLG systems in Czech and their ability to deal with rich morphology.
Communicative GoalProducing a text expressing the given intent/dialogue act and all and only the attributes specified in the input MR.
Sourced from Different Sourcesno
Created for the dataset
Creation ProcessSix professional translators translated the underlying dataset with the following instructions:
The translators did not have access to the meaning representation.
Data Validationvalidated by data curator
Was Data Filtered?not filtered
none
Annotation Service?no
no
Justification for Using the DataIt was not explicitly stated but we can safely assume that the translators agreed to this use of their data.
no PII
Justification for no PIIThis dataset does not include any information about individuals.
no
no
yes
Details on how Dataset Addresses the NeedsThe dataset may help improve NLG methods for morphologically rich languages beyond Czech.
yes
Links and Summaries of Analysis WorkTo ensure consistency of translation, the data always uses formal/polite address for the user, and uses the female form for first-person self-references (as if the dialogue agent producing the sentences was female). This prevents data sparsity and ensures consistent results for systems trained on the dataset, but does not represent all potential situations arising in Czech.
open license - commercial use allowed
Copyright Restrictions on the Language Dataopen license - commercial use allowed
The test set may lead users to over-estimate the performance of their NLG systems with respect to their generalisability, because there are no unseen restaurants or addresses in the test set. This is something we will look into for future editions of the GEM shared task.