数据集:
taskmaster2
子任务:
dialogue-modeling语言:
计算机处理:
monolingual大小:
1K<n<10K语言创建人:
crowdsourced批注创建人:
crowdsourced源数据集:
original预印本库:
arxiv:1909.05358许可:
Taskmaster is dataset for goal oriented conversations. The Taskmaster-2 dataset consists of 17,289 dialogs in the seven domains which include restaurants, food ordering, movies, hotels, flights, music and sports. Unlike Taskmaster-1, which includes both written "self-dialogs" and spoken two-person dialogs, Taskmaster-2 consists entirely of spoken two-person dialogs. In addition, while Taskmaster-1 is almost exclusively task-based, Taskmaster-2 contains a good number of search- and recommendation-oriented dialogs. All dialogs in this release were created using a Wizard of Oz (WOz) methodology in which crowdsourced workers played the role of a 'user' and trained call center operators played the role of the 'assistant'. In this way, users were led to believe they were interacting with an automated system that “spoke” using text-to-speech (TTS) even though it was in fact a human behind the scenes. As a result, users could express themselves however they chose in the context of an automated interface.
[More Information Needed]
The dataset is in English language.
A typical example looks like this
{
"conversation_id": "dlg-0047a087-6a3c-4f27-b0e6-268f53a2e013",
"instruction_id": "flight-6",
"utterances": [
{
"index": 0,
"segments": [],
"speaker": "USER",
"text": "Hi, I'm looking for a flight. I need to visit a friend."
},
{
"index": 1,
"segments": [],
"speaker": "ASSISTANT",
"text": "Hello, how can I help you?"
},
{
"index": 2,
"segments": [],
"speaker": "ASSISTANT",
"text": "Sure, I can help you with that."
},
{
"index": 3,
"segments": [],
"speaker": "ASSISTANT",
"text": "On what dates?"
},
{
"index": 4,
"segments": [
{
"annotations": [
{
"name": "flight_search.date.depart_origin"
}
],
"end_index": 37,
"start_index": 27,
"text": "March 20th"
},
{
"annotations": [
{
"name": "flight_search.date.return"
}
],
"end_index": 45,
"start_index": 41,
"text": "22nd"
}
],
"speaker": "USER",
"text": "I'm looking to travel from March 20th to 22nd."
}
]
}
Each conversation in the data file has the following structure:
Each utterance has the following fields:
Each segment has the following fields:
Each annotation has a single field:
There are no deafults splits for all the config. The below table lists the number of examples in each config.
| Config | Train |
|---|---|
| flights | 2481 |
| food-orderings | 1050 |
| hotels | 2355 |
| movies | 3047 |
| music | 1602 |
| restaurant-search | 3276 |
| sports | 3478 |
[More Information Needed]
[More Information Needed]
Initial Data Collection and Normalization[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Annotation process[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
The dataset is licensed under Creative Commons Attribution 4.0 License
[More Information Needed]
@inproceedings{48484,
title = {Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset},
author = {Bill Byrne and Karthik Krishnamoorthi and Chinnadhurai Sankar and Arvind Neelakantan and Daniel Duckworth and Semih Yavuz and Ben Goodrich and Amit Dubey and Kyu-Young Kim and Andy Cedilnik},
year = {2019}
}
Thanks to @patil-suraj for adding this dataset.