数据集:
schema_guided_dstc8
The Schema-Guided Dialogue dataset (SGD) was developed for the Dialogue State Tracking task of the Eights Dialogue Systems Technology Challenge (dstc8). The SGD dataset consists of over 18k annotated multi-domain, task-oriented conversations between a human and a virtual assistant. These conversations involve interactions with services and APIs spanning 17 domains, ranging from banks and events to media, calendar, travel, and weather. For most of these domains, the SGD dataset contains multiple different APIs, many of which have overlapping functionalities but different interfaces, which reflects common real-world scenarios.
This dataset is designed to serve as an effective test-bed for intent prediction, slot filling, state tracking (i.e., estimating the user's goal) and language generation, among other tasks for large-scale virtual assistants:
The text in the dataset is in English ( en ).
Each dialog instance has the following fields:
The mapping from the action ID and the action name is the following:
0: AFFIRM 1: AFFIRM_INTENT 2: CONFIRM 3: GOODBYE 4: INFORM 5: INFORM_COUNT 6: INFORM_INTENT 7: NEGATE 8: NEGATE_INTENT 9: NOTIFY_FAILURE 10: NOTIFY_SUCCESS 11: OFFER 12: OFFER_INTENT 13: REQUEST 14: REQUEST_ALTS 15: REQ_MORE 16: SELECT 17: THANK_YOU
The dataset is split into a train , validation , and test split with the following sizes:
train | validation | test | |
---|---|---|---|
Number of dialogues | 16142 | 2482 | 4201 |
Number of turns | 48426 | 7446 | 12603 |
The data was collected by first using a dialogue simulator to generate dialogue outlines first and then paraphrasing them to obtain natural utterances. Using a dialogue simulator ensures the coverage of a large variety of dialogue flows by filtering out similar flows in the simulation phase to create a diverse dataset, and dialogues can be generated with their annotation, as opposed to a Wizard-of-Oz setup which is prone to manual annotation errors.
The dialogue outlines are first generated by a simulator. The dialogue simulator interacts with the services to generate dialogue outlines. It consists of two agents playing the roles of the user and the system, interacting with each other using a finite set of actions specified through dialogue acts over a probabilistic automaton designed to capture varied dialogue trajectories. It is worth noting that the simulation automaton does not include any domain-specific constraints: all domain-specific constraints are encoded in the schema and scenario.
The dialogue paraphrasing framework then converts the outlines generated by the simulator into a natural conversation. Users may refer to the slot values in the dialogue acts in various different ways during the conversation, e.g., “los angeles” may be referred to as “LA” or “LAX”. To introduce these natural variations in the slot values, different slot values are replaced with a randomly selected variation while being kept consistent across user turns in a dialogue. The actions are then converted to pseudo-natural language utterances using a set of manually defined action-to-text templates, and the resulting utterances for the different actions in a turn are concatenated together.
Finally, the dialogue transformed by these steps is sent to the crowd workers to be reformulated into more natural language. One crowd worker is tasked with paraphrasing all utterances of a dialogue to ensure naturalness and coherence. The crowd workers are asked to exactly repeat the slot values in their paraphrases so that the span indices for the slots can be recovered via string matching.
Who are the source language producers?The language structure is machine-generated, and the language realizations are produced by crowd workers. The dataset paper does not provide demographic information for the crowd workers.
The annotations are automatically obtained during the initial sampling process and by string matching after reformulation.
Who are the annotators?[N/A]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
The dataset was created by a team of researchers working at Google Mountain View.
The dataset is released under CC BY-SA 4.0 license.
For the DSCT8 task, please cite:
@article{corr/abs-2002-01359, author = {Abhinav Rastogi and Xiaoxue Zang and Srinivas Sunkara and Raghav Gupta and Pranav Khaitan}, title = {Schema-Guided Dialogue State Tracking Task at {DSTC8}}, journal = {CoRR}, volume = {abs/2002.01359}, year = {2020}, url = {https://arxiv.org/abs/2002.01359}, archivePrefix = {arXiv}, eprint = {2002.01359} }
For the initial release paper please cite:
@inproceedings{aaai/RastogiZSGK20, author = {Abhinav Rastogi and Xiaoxue Zang and Srinivas Sunkara and Raghav Gupta and Pranav Khaitan}, title = {Towards Scalable Multi-Domain Conversational Agents: The Schema-Guided Dialogue Dataset}, booktitle = {The Thirty-Fourth {AAAI} Conference on Artificial Intelligence, {AAAI} 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, {IAAI} 2020, The Tenth {AAAI} Symposium on Educational Advances in Artificial Intelligence, {EAAI} 2020, New York, NY, USA, February 7-12, 2020}, pages = {8689--8696}, publisher = {{AAAI} Press}, year = {2020}, url = {https://aaai.org/ojs/index.php/AAAI/article/view/6394} }
Thanks to @yjernite for adding this dataset.