数据集:

allenai/prosocial-dialog

任务:

对话

文本分类

子任务:

dialogue-generation multi-class-classification

语言:

计算机处理:

monolingual

大小:

10K<n<100K 100K<n<1M

语言创建人:

crowdsourced machine-generated

批注创建人:

crowdsourced

源数据集:

original extended|social_bias_frames

预印本库:

arxiv:2205.12688

其他:

dialogue dialogue safety social norm dialogue+safety social+norm

许可:

cc-by-4.0

数据集介绍文件清单

中文

Dataset Card for ProsocialDialog Dataset

Dataset Summary

ProsocialDialog is the first large-scale multi-turn English dialogue dataset to teach conversational agents to respond to problematic content following social norms. Covering diverse unethical, problematic, biased, and toxic situations, ProsocialDialog contains responses that encourage prosocial behavior, grounded in commonsense social rules (i.e., rules-of-thumb, RoTs). Created via a human-AI collaborative framework, ProsocialDialog consists of 58K dialogues, with 331K utterances, 160K unique RoTs, and 497K dialogue safety labels accompanied by free-form rationales.

Supported Tasks

Dialogue response generation
Dialogue safety prediction
Rules-of-thumb generation

Languages

English

Dataset Structure

Data Attributes

attribute	type	description
context	str	the potentially unsafe utterance
response	str	the guiding utterance grounded on rules-of-thumb ( rots )
rots	list of str\|null	the relevant rules-of-thumb for text not labeled as __casual__
safety_label	str	the final verdict of the context according to safety_annotations : {__casual__, __possibly_needs_caution__, __probably_needs_caution__, __needs_caution__, __needs_intervention__}
safety_annotations	list of str	raw annotations from three workers: {casual, needs caution, needs intervention}
safety_annotation_reasons	list of str	the reasons behind the safety annotations in free-form text from each worker
source	str	the source of the seed text that was used to craft the first utterance of the dialogue: {socialchemistry, sbic, ethics_amt, ethics_reddit}
etc	str\|null	other information
dialogue_id	int	the dialogue index
response_id	int	the response index
episode_done	bool	an indicator of whether it is the end of the dialogue

Dataset Creation

To create ProsocialDialog, we set up a human-AI collaborative data creation framework, where GPT-3 generates the potentially unsafe utterances, and crowdworkers provide prosocial responses to them. This approach allows us to circumvent two substantial challenges: (1) there are no available large-scale corpora of multiturn prosocial conversations between humans, and (2) asking humans to write unethical, toxic, or problematic utterances could result in psychological harms (Roberts, 2017; Steiger et al., 2021).

Further Details, Social Impacts, and Limitations

Please refer to our paper .

Additional Information

Citation

Please cite our work if you found the resources in this repository useful:

@inproceedings{kim2022prosocialdialog,
    title={ProsocialDialog: A Prosocial Backbone for Conversational Agents},
    author={Hyunwoo Kim and Youngjae Yu and Liwei Jiang and Ximing Lu and Daniel Khashabi and Gunhee Kim and Yejin Choi and Maarten Sap},
    booktitle={EMNLP},
    year=2022
}

作者:

allenai

数据集大小:

111.7 MB