数据集:

nlu_evaluation_data

任务:

文本分类

子任务:

intent-classification multi-class-classification

语言:

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

expert-generated

批注创建人:

expert-generated

源数据集:

original

预印本库:

arxiv:1903.05566

许可:

cc-by-4.0

数据集介绍文件清单

中文

Dataset Card for NLU Evaluation Data

Dataset Summary

Dataset with short utterances from conversational domain annotated with their corresponding intents and scenarios.

It has 25 715 non-zero examples (original dataset has 25716 examples) belonging to 18 scenarios and 68 intents. Originally, the dataset was crowd-sourced and annotated with both intents and named entities in order to evaluate commercial NLU systems such as RASA, IBM's Watson, Microsoft's LUIS and Google's Dialogflow. This version of the dataset only includes intent annotations!

In contrast to paper claims, released data contains 68 unique intents. This is due to the fact, that NLU systems were evaluated on more curated part of this dataset which only included 64 most important intents. Read more in github issue .

Supported Tasks and Leaderboards

Intent classification, intent detection

Languages

English

Dataset Structure

Data Instances

An example of 'train' looks as follows:

{
  'label': 2, # integer label corresponding to "alarm_set" intent
  'scenario': 'alarm', 
  'text': 'wake me up at five am this week'
}

Data Fields

text : a string feature.
label : one of classification labels (0-67) corresponding to unique intents.
scenario : a string with one of unique scenarios (18).

Intent names are mapped to label in the following way:

label	intent
0	alarm_query
1	alarm_remove
2	alarm_set
3	audio_volume_down
4	audio_volume_mute
5	audio_volume_other
6	audio_volume_up
7	calendar_query
8	calendar_remove
9	calendar_set
10	cooking_query
11	cooking_recipe
12	datetime_convert
13	datetime_query
14	email_addcontact
15	email_query
16	email_querycontact
17	email_sendemail
18	general_affirm
19	general_commandstop
20	general_confirm
21	general_dontcare
22	general_explain
23	general_greet
24	general_joke
25	general_negate
26	general_praise
27	general_quirky
28	general_repeat
29	iot_cleaning
30	iot_coffee
31	iot_hue_lightchange
32	iot_hue_lightdim
33	iot_hue_lightoff
34	iot_hue_lighton
35	iot_hue_lightup
36	iot_wemo_off
37	iot_wemo_on
38	lists_createoradd
39	lists_query
40	lists_remove
41	music_dislikeness
42	music_likeness
43	music_query
44	music_settings
45	news_query
46	play_audiobook
47	play_game
48	play_music
49	play_podcasts
50	play_radio
51	qa_currency
52	qa_definition
53	qa_factoid
54	qa_maths
55	qa_stock
56	recommendation_events
57	recommendation_locations
58	recommendation_movies
59	social_post
60	social_query
61	takeaway_order
62	takeaway_query
63	transport_query
64	transport_taxi
65	transport_ticket
66	transport_traffic
67	weather_query

Data Splits

Dataset statistics	Train
Number of examples	25 715
Average character length	34.32
Number of intents	68
Number of scenarios	18

Dataset Creation

Curation Rationale

The dataset was prepared for a wide coverage evaluation and comparison of some of the most popular NLU services. At that time, previous benchmarks were done with few intents and spawning limited number of domains. Here, the dataset is much larger and contains 68 intents from 18 scenarios, which is much larger that any previous evaluation. For more discussion see the paper.

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

To build the NLU component we collected real user data via Amazon Mechanical Turk (AMT). We designed tasks where the Turker’s goal was to answer questions about how people would interact with the home robot, in a wide range of scenarios designed in advance, namely: alarm, audio, audiobook, calendar, cooking, datetime, email, game, general, IoT, lists, music, news, podcasts, general Q&A, radio, recommendations, social, food takeaway, transport, and weather. The questions put to Turkers were designed to capture the different requests within each given scenario. In the ‘calendar’ scenario, for example, these pre-designed intents were included: ‘set event’, ‘delete event’ and ‘query event’. An example question for intent ‘set event’ is: “How would you ask your PDA to schedule a meeting with someone?” for which a user’s answer example was “Schedule a chat with Adam on Thursday afternoon”. The Turkers would then type in their answers to these questions and select possible entities from the pre-designed suggested entities list for each of their answers.The Turkers didn’t always follow the instructions fully, e.g. for the specified ‘delete event’ Intent, an answer was: “PDA what is my next event?”; which clearly belongs to ‘query event’ Intent. We have manually corrected all such errors either during post-processing or the subsequent annotations.

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

The purpose of this dataset it to help develop better intent detection systems.

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

Creative Commons Attribution 4.0 International License (CC BY 4.0)

Citation Information

@InProceedings{XLiu.etal:IWSDS2019,
  author    = {Xingkun Liu, Arash Eshghi, Pawel Swietojanski and Verena Rieser},
  title     = {Benchmarking Natural Language Understanding Services for building Conversational Agents},
  booktitle = {Proceedings of the Tenth International Workshop on Spoken Dialogue Systems Technology (IWSDS)},
  month     = {April},
  year      = {2019},
  address   = {Ortigia, Siracusa (SR), Italy},
  publisher = {Springer},
  pages     = {xxx--xxx},
  url       = {http://www.xx.xx/xx/}
}

Contributions

Thanks to @dkajtoch for adding this dataset.

作者:

佚名

数据集大小:

24.44 KB