数据集:
taskmaster2
子任务:
dialogue-modeling语言:
en计算机处理:
monolingual大小:
1K<n<10K语言创建人:
crowdsourced批注创建人:
crowdsourced源数据集:
original预印本库:
arxiv:1909.05358许可:
cc-by-4.0Taskmaster is dataset for goal oriented conversations. The Taskmaster-2 dataset consists of 17,289 dialogs in the seven domains which include restaurants, food ordering, movies, hotels, flights, music and sports. Unlike Taskmaster-1, which includes both written "self-dialogs" and spoken two-person dialogs, Taskmaster-2 consists entirely of spoken two-person dialogs. In addition, while Taskmaster-1 is almost exclusively task-based, Taskmaster-2 contains a good number of search- and recommendation-oriented dialogs. All dialogs in this release were created using a Wizard of Oz (WOz) methodology in which crowdsourced workers played the role of a 'user' and trained call center operators played the role of the 'assistant'. In this way, users were led to believe they were interacting with an automated system that “spoke” using text-to-speech (TTS) even though it was in fact a human behind the scenes. As a result, users could express themselves however they chose in the context of an automated interface.
[More Information Needed]
The dataset is in English language.
A typical example looks like this
{ "conversation_id": "dlg-0047a087-6a3c-4f27-b0e6-268f53a2e013", "instruction_id": "flight-6", "utterances": [ { "index": 0, "segments": [], "speaker": "USER", "text": "Hi, I'm looking for a flight. I need to visit a friend." }, { "index": 1, "segments": [], "speaker": "ASSISTANT", "text": "Hello, how can I help you?" }, { "index": 2, "segments": [], "speaker": "ASSISTANT", "text": "Sure, I can help you with that." }, { "index": 3, "segments": [], "speaker": "ASSISTANT", "text": "On what dates?" }, { "index": 4, "segments": [ { "annotations": [ { "name": "flight_search.date.depart_origin" } ], "end_index": 37, "start_index": 27, "text": "March 20th" }, { "annotations": [ { "name": "flight_search.date.return" } ], "end_index": 45, "start_index": 41, "text": "22nd" } ], "speaker": "USER", "text": "I'm looking to travel from March 20th to 22nd." } ] }
Each conversation in the data file has the following structure:
Each utterance has the following fields:
Each segment has the following fields:
Each annotation has a single field:
There are no deafults splits for all the config. The below table lists the number of examples in each config.
Config | Train |
---|---|
flights | 2481 |
food-orderings | 1050 |
hotels | 2355 |
movies | 3047 |
music | 1602 |
restaurant-search | 3276 |
sports | 3478 |
[More Information Needed]
[More Information Needed]
Initial Data Collection and Normalization[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Annotation process[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
The dataset is licensed under Creative Commons Attribution 4.0 License
[More Information Needed]
@inproceedings{48484, title = {Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset}, author = {Bill Byrne and Karthik Krishnamoorthi and Chinnadhurai Sankar and Arvind Neelakantan and Daniel Duckworth and Semih Yavuz and Ben Goodrich and Amit Dubey and Kyu-Young Kim and Andy Cedilnik}, year = {2019} }
Thanks to @patil-suraj for adding this dataset.