数据集:
crd3
子任务:
dialogue-modeling语言:
en计算机处理:
monolingual大小:
10K<n<100K语言创建人:
crowdsourced批注创建人:
no-annotation源数据集:
original许可:
cc-by-sa-4.0Storytelling with Dialogue: A Critical Role Dungeons and Dragons Dataset. Critical Role is an unscripted, live-streamed show where a fixed group of people play Dungeons and Dragons, an open-ended role-playing game. The dataset is collected from 159 Critical Role episodes transcribed to text dialogues, consisting of 398,682 turns. It also includes corresponding abstractive summaries collected from the Fandom wiki. The dataset is linguistically unique in that the narratives are generated entirely through player collaboration and spoken interaction. For each dialogue, there are a large number of turns, multiple abstractive summaries with varying levels of detail, and semantic ties to the previous dialogues.
summarization : The dataset can be used to train a model for abstractive summarization. A fast abstractive summarization-RL model was presented as a baseline, which achieves ROUGE-L-F1 of 25.18.
The text in the dataset is in English, as spoken by actors on The Critical Role show, which is a weekly unscripted, live-stream of a fixed group of people playing Dungeons and Dragons, a popular role-playing game.
An example of 'train' looks as follows.
{ "alignment_score": 3.679936647415161, "chunk": "Wish them a Happy Birthday on their Facebook and Twitter pages! Also, as a reminder: D&D Beyond streams their weekly show (\"And Beyond\") every Wednesday on twitch.tv/dndbeyond.", "chunk_id": 1, "turn_end": 6, "turn_num": 4, "turn_start": 4, "turns": { "names": ["SAM"], "utterances": ["Yesterday, guys, was D&D Beyond's first one--", "first one-year anniversary. Take two. Hey guys,", "yesterday was D&D Beyond's one-year anniversary.", "Wish them a happy birthday on their Facebook and", "Twitter pages."] } }
The data fields are the same among all splits.
name | train | validation | test |
---|---|---|---|
default | 38,969 | 6,327 | 7,500 |
Dialogue understanding and abstractive summarization remain both important and challenging problems for computational linguistics. Current paradigms in summarization modeling have specific failures in capturing semantics and pragmatics, content selection, rewriting, and evaluation in the domain of long, story-telling dialogue. CRD3 offers a linguistically rich dataset to explore these domains.
Dungeons and Dragons is a popular roleplaying game that is driven by structured storytelling. Critical Role is an unscripted, live-streamed show where a fixed group of people play Dungeons and Dragons. This dataset consists of 159 episodes of the show, where the episodes are transcribed. Inconsistencies (e.g. spelling of speaker names) were manually resolved.
The abstractive summaries were collected from the Critical Role Fandom wiki
Who are the source language producers?The language producers are actors on The Critical Role show, which is a weekly unscripted, live-stream of a fixed group of people playing Dungeons and Dragons, a popular role-playing game.
[N/A]
Who are the annotators?[N/A]
[N/A]
CRTranscript provided transcripts of the show; contributors of the Critical Role Wiki provided the abstractive summaries.
This work is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License][cc-by-sa-4.0]., as corresponding to the Critical Role Wiki https://criticalrole.fandom.com/
@inproceedings{ title = {Storytelling with Dialogue: A Critical Role Dungeons and Dragons Dataset}, author = {Rameshkumar, Revanth and Bailey, Peter}, year = {2020}, publisher = {Association for Computational Linguistics}, conference = {ACL} }
Thanks to @thomwolf , @lhoestq , @mariamabarham , @lewtun for adding this dataset.