数据集:

crd3

语言:

en

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

crowdsourced

批注创建人:

no-annotation

源数据集:

original
中文

Dataset Card for "crd3"

Dataset Summary

Storytelling with Dialogue: A Critical Role Dungeons and Dragons Dataset. Critical Role is an unscripted, live-streamed show where a fixed group of people play Dungeons and Dragons, an open-ended role-playing game. The dataset is collected from 159 Critical Role episodes transcribed to text dialogues, consisting of 398,682 turns. It also includes corresponding abstractive summaries collected from the Fandom wiki. The dataset is linguistically unique in that the narratives are generated entirely through player collaboration and spoken interaction. For each dialogue, there are a large number of turns, multiple abstractive summaries with varying levels of detail, and semantic ties to the previous dialogues.

Supported Tasks and Leaderboards

summarization : The dataset can be used to train a model for abstractive summarization. A fast abstractive summarization-RL model was presented as a baseline, which achieves ROUGE-L-F1 of 25.18.

Languages

The text in the dataset is in English, as spoken by actors on The Critical Role show, which is a weekly unscripted, live-stream of a fixed group of people playing Dungeons and Dragons, a popular role-playing game.

Dataset Structure

Data Instances

An example of 'train' looks as follows.

{
    "alignment_score": 3.679936647415161,
    "chunk": "Wish them a Happy Birthday on their Facebook and Twitter pages! Also, as a reminder: D&D Beyond streams their weekly show (\"And Beyond\") every Wednesday on twitch.tv/dndbeyond.",
    "chunk_id": 1,
    "turn_end": 6,
    "turn_num": 4,
    "turn_start": 4,
    "turns": {
        "names": ["SAM"],
        "utterances": ["Yesterday, guys, was D&D Beyond's first one--", "first one-year anniversary. Take two. Hey guys,", "yesterday was D&D Beyond's one-year anniversary.", "Wish them a happy birthday on their Facebook and", "Twitter pages."]
    }
}

Data Fields

The data fields are the same among all splits.

  • chunk : a string feature.
  • chunk_id : a int32 feature.
  • turn_start : a int32 feature.
  • turn_end : a int32 feature.
  • alignment_score : a float32 feature.
  • turn_num : a int32 feature.
  • turns : a dictionary feature containing:
    • names : a string feature.
    • utterances : a string feature.

Data Splits

name train validation test
default 38,969 6,327 7,500

Dataset Creation

Curation Rationale

Dialogue understanding and abstractive summarization remain both important and challenging problems for computational linguistics. Current paradigms in summarization modeling have specific failures in capturing semantics and pragmatics, content selection, rewriting, and evaluation in the domain of long, story-telling dialogue. CRD3 offers a linguistically rich dataset to explore these domains.

Source Data

Initial Data Collection and Normalization

Dungeons and Dragons is a popular roleplaying game that is driven by structured storytelling. Critical Role is an unscripted, live-streamed show where a fixed group of people play Dungeons and Dragons. This dataset consists of 159 episodes of the show, where the episodes are transcribed. Inconsistencies (e.g. spelling of speaker names) were manually resolved.

The abstractive summaries were collected from the Critical Role Fandom wiki

Who are the source language producers?

The language producers are actors on The Critical Role show, which is a weekly unscripted, live-stream of a fixed group of people playing Dungeons and Dragons, a popular role-playing game.

Annotations

Annotation process

[N/A]

Who are the annotators?

[N/A]

Personal and Sensitive Information

[N/A]

Considerations for Using the Data

Social Impact of Dataset

More Information Needed

Discussion of Biases

More Information Needed

Other Known Limitations

More Information Needed

Additional Information

Dataset Curators

CRTranscript provided transcripts of the show; contributors of the Critical Role Wiki provided the abstractive summaries.

Licensing Information

This work is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License][cc-by-sa-4.0]., as corresponding to the Critical Role Wiki https://criticalrole.fandom.com/

Citation Information

@inproceedings{
title = {Storytelling with Dialogue: A Critical Role Dungeons and Dragons Dataset},
author = {Rameshkumar, Revanth  and Bailey, Peter},
year = {2020},
publisher = {Association for Computational Linguistics},
conference = {ACL}
}

Contributions

Thanks to @thomwolf , @lhoestq , @mariamabarham , @lewtun for adding this dataset.