数据集:
ubuntu_dialogs_corpus
任务:
对话子任务:
dialogue-generation语言:
en计算机处理:
monolingual大小:
1M<n<10M语言创建人:
found批注创建人:
found源数据集:
original预印本库:
arxiv:1506.08909许可:
license:unknownUbuntu Dialogue Corpus, a dataset containing almost 1 million multi-turn dialogues, with a total of over 7 million utterances and 100 million words. This provides a unique resource for research into building dialogue managers based on neural language models that can make use of large amounts of unlabeled data. The dataset has both the multi-turn property of conversations in the Dialog State Tracking Challenge datasets, and the unstructured nature of interactions from microblog services such as Twitter.
An example of 'train' looks as follows.
This example was too long and was cropped: { "Context": "\"i think we could import the old comment via rsync , but from there we need to go via email . i think it be easier than cach the...", "Label": 1, "Utterance": "basic each xfree86 upload will not forc user to upgrad 100mb of font for noth __eou__ no someth i do in my spare time . __eou__" }
The data fields are the same among all splits.
trainname | train |
---|---|
train | 127422 |
@article{DBLP:journals/corr/LowePSP15, author = {Ryan Lowe and Nissan Pow and Iulian Serban and Joelle Pineau}, title = {The Ubuntu Dialogue Corpus: {A} Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems}, journal = {CoRR}, volume = {abs/1506.08909}, year = {2015}, url = {http://arxiv.org/abs/1506.08909}, archivePrefix = {arXiv}, eprint = {1506.08909}, timestamp = {Mon, 13 Aug 2018 16:48:23 +0200}, biburl = {https://dblp.org/rec/journals/corr/LowePSP15.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
Thanks to @thomwolf , @patrickvonplaten , @lewtun for adding this dataset.