数据集:
mdd
子任务:
dialogue-modeling语言:
en计算机处理:
monolingual语言创建人:
found批注创建人:
no-annotation源数据集:
original预印本库:
arxiv:1511.06931许可:
cc-by-3.0The Movie Dialog dataset (MDD) is designed to measure how well models can perform at goal and non-goal orientated dialog centered around the topic of movies (question answering, recommendation and discussion), from various movie reviews sources such as MovieLens and OMDb.
[More Information Needed]
The data is present in English language as written by users on OMDb and MovieLens websites.
An instance from the task3_qarecs config's train split:
{'dialogue_turns': {'speaker': [0, 1, 0, 1, 0, 1], 'utterance': ["I really like Jaws, Bottle Rocket, Saving Private Ryan, Tommy Boy, The Muppet Movie, Face/Off, and Cool Hand Luke. I'm looking for a Documentary movie.", 'Beyond the Mat', 'Who is that directed by?', 'Barry W. Blaustein', 'I like Jon Fauer movies more. Do you know anything else?', 'Cinematographer Style']}}
An instance from the task4_reddit config's cand-valid split:
{'dialogue_turns': {'speaker': [0], 'utterance': ['MORTAL KOMBAT !']}}
For all configurations:
The splits and corresponding sizes are:
config | train | test | validation | cand_valid | cand_test |
---|---|---|---|---|---|
task1_qa | 96185 | 9952 | 9968 | - | - |
task2_recs | 1000000 | 10000 | 10000 | - | - |
task3_qarecs | 952125 | 4915 | 5052 | - | - |
task4_reddit | 945198 | 10000 | 10000 | 10000 | 10000 |
The cand_valid and cand_test are negative candidates for the task4_reddit configuration which is used in ranking true positive against these candidates and hits@k (or another ranking metric) is reported. (See paper)
[More Information Needed]
The construction of the tasks depended on some existing datasets:
MovieLens. The data was downloaded from: http://grouplens.org/datasets/movielens/20m/ on May 27th, 2015.
OMDB. The data was downloaded from: http://beforethecode.com/projects/omdb/download.aspx on May 28th, 2015.
For task4_reddit , the data is a processed subset (movie subreddit only) of the data available at: https://www.reddit.com/r/datasets/comments/3bxlg7
Users on MovieLens, OMDB website and reddit websites, among others.
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Jesse Dodge and Andreea Gane and Xiang Zhang and Antoine Bordes and Sumit Chopra and Alexander Miller and Arthur Szlam and Jason Weston (at Facebook Research).
Creative Commons Attribution 3.0 License
@misc{dodge2016evaluating, title={Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems}, author={Jesse Dodge and Andreea Gane and Xiang Zhang and Antoine Bordes and Sumit Chopra and Alexander Miller and Arthur Szlam and Jason Weston}, year={2016}, eprint={1511.06931}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Thanks to @gchhablani for adding this dataset.