数据集:

mdd

任务:

文本生成

填充掩码

子任务:

dialogue-modeling

语言:

计算机处理:

monolingual

大小:

100K<n<1M 1M<n<10M

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original

预印本库:

arxiv:1511.06931

许可:

cc-by-3.0

数据集介绍文件清单

中文

Dataset Card for MDD

Dataset Summary

The Movie Dialog dataset (MDD) is designed to measure how well models can perform at goal and non-goal orientated dialog centered around the topic of movies (question answering, recommendation and discussion), from various movie reviews sources such as MovieLens and OMDb.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

The data is present in English language as written by users on OMDb and MovieLens websites.

Dataset Structure

Data Instances

An instance from the task3_qarecs config's train split:

{'dialogue_turns': {'speaker': [0, 1, 0, 1, 0, 1], 'utterance': ["I really like Jaws, Bottle Rocket, Saving Private Ryan, Tommy Boy, The Muppet Movie, Face/Off, and Cool Hand Luke. I'm looking for a Documentary movie.", 'Beyond the Mat', 'Who is that directed by?', 'Barry W. Blaustein', 'I like Jon Fauer movies more. Do you know anything else?', 'Cinematographer Style']}}

An instance from the task4_reddit config's cand-valid split:

{'dialogue_turns': {'speaker': [0], 'utterance': ['MORTAL KOMBAT !']}}

Data Fields

For all configurations:

dialogue_turns : a dictionary feature containing:
- speaker : an integer with possible values including 0 , 1 , indicating which speaker wrote the utterance.
- utterance : a string feature containing the text utterance.

Data Splits

The splits and corresponding sizes are:

config	train	test	validation	cand_valid	cand_test
task1_qa	96185	9952	9968	-	-
task2_recs	1000000	10000	10000	-	-
task3_qarecs	952125	4915	5052	-	-
task4_reddit	945198	10000	10000	10000	10000

The cand_valid and cand_test are negative candidates for the task4_reddit configuration which is used in ranking true positive against these candidates and hits@k (or another ranking metric) is reported. (See paper)

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

The construction of the tasks depended on some existing datasets:

MovieLens. The data was downloaded from: http://grouplens.org/datasets/movielens/20m/ on May 27th, 2015.

OMDB. The data was downloaded from: http://beforethecode.com/projects/omdb/download.aspx on May 28th, 2015.

For task4_reddit , the data is a processed subset (movie subreddit only) of the data available at: https://www.reddit.com/r/datasets/comments/3bxlg7

Who are the source language producers?

Users on MovieLens, OMDB website and reddit websites, among others.

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

Jesse Dodge and Andreea Gane and Xiang Zhang and Antoine Bordes and Sumit Chopra and Alexander Miller and Arthur Szlam and Jason Weston (at Facebook Research).

Licensing Information

Creative Commons Attribution 3.0 License

Citation Information

@misc{dodge2016evaluating,
      title={Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems}, 
      author={Jesse Dodge and Andreea Gane and Xiang Zhang and Antoine Bordes and Sumit Chopra and Alexander Miller and Arthur Szlam and Jason Weston},
      year={2016},
      eprint={1511.06931},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributions

Thanks to @gchhablani for adding this dataset.

作者:

佚名

数据集大小:

25.43 KB