数据集:

cmu_hinglish_dog

任务:

翻译

语言:

en hi

大小:

1K<n<10K

语言创建人:

crowdsourced

批注创建人:

machine-generated

源数据集:

original

预印本库:

arxiv:1809.07358
中文

Dataset Card for CMU Document Grounded Conversations

Dataset Summary

This is a collection of text conversations in Hinglish (code mixing between Hindi-English) and their corresponding English versions. Can be used for Translating between the two. The dataset has been provided by Prof. Alan Black's group from CMU.

Supported Tasks and Leaderboards

  • abstractive-mt

Languages

Dataset Structure

Data Instances

A typical data point comprises a Hinglish text, with key hi_en and its English version with key en . The docIdx contains the current section index of the wiki document when the utterance is said. There are in total 4 sections for each document. The uid has the user id of this utterance.

An example from the CMU_Hinglish_DoG train set looks as follows:

{'rating': 2,
 'wikiDocumentIdx': 13,
 'utcTimestamp': '2018-03-16T17:48:22.037Z',
 'uid': 'user2',
 'date': '2018-03-16T17:47:21.964Z',
 'uid2response': {'response': [1, 2, 3, 5], 'type': 'finish'},
 'uid1LogInTime': '2018-03-16T17:47:21.964Z',
 'user2_id': 'USR664',
 'uid1LogOutTime': '2018-03-16T18:02:29.072Z',
 'whoSawDoc': ['user1', 'user2'],
 'status': 1,
 'docIdx': 0,
 'uid1response': {'response': [1, 2, 3, 4], 'type': 'finish'},
 'translation': {'en': 'The director is Zack Snyder, 27% Rotten Tomatoes, 4.9/10.',
  'hi_en': 'Zack Snyder director hai, 27% Rotten Tomatoes, 4.9/10.'}}

Data Fields

  • date : the time the file is created, as a string
  • docIdx : the current section index of the wiki document when the utterance is said. There are in total 4 sections for each document.
  • translation :
    • hi_en : The text in Hinglish
    • en : The text in English
  • uid : the user id of this utterance.
  • utcTimestamp : the server utc timestamp of this utterance, as a string
  • rating : A number from 1 or 2 or 3. A larger number means the quality of the conversation is better.
  • status : status as an integer
  • uid1LogInTime : optional login time of user 1, as a string
  • uid1LogOutTime : optional logout time of user 1, as a string
  • uid1response : a json object contains the status and response of user after finishing the conversation. Fields in the object includes:
    • type : should be one of ['finish', 'abandon','abandonWithouAnsweringFeedbackQuestion']. 'finish' means the user successfully finishes the conversation, either by completing 12 or 15 turns or in the way that the other user leaves the conversation first. 'abandon' means the user abandons the conversation in the middle, but entering the feedback page. 'abandonWithouAnsweringFeedbackQuestion' means the user just disconnects or closes the web page without providing the feedback.
    • response : the answer to the post-conversation questions. The worker can choose multiple of them. The options presented to the user are as follows: For type 'finish' 1: The conversation is understandable. 2: The other user is actively responding me. 3: The conversation goes smoothly. For type 'abandon' 1: The other user is too rude. 2: I don't know how to proceed with the conversation. 3: The other user is not responding to me. For users given the document 4: I have watched the movie before. 5: I have not watched the movie before. For the users without the document 4: I will watch the movie after the other user's introduction. 5: I will not watch the movie after the other user's introduction.
  • uid2response : same as uid1response
  • user2_id : the generated user id of user 2
  • whoSawDoc : Should be one of ['user1'], ['user2'], ['user1', 'user2']. Indicating which user read the document.
  • wikiDocumentId : the index of the wiki document.

Data Splits

name train validation test
CMU DOG 8060 942 960

Dataset Creation

[More Information Needed]

Curation Rationale

[More Information Needed]

Source Data

The Hinglish dataset is derived from the original CMU DoG (Document Grounded Conversations Dataset). More info about that can be found in the repo

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

[More Information Needed]

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

The purpose of this dataset is to help develop better question answering systems.

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

The dataset was initially created by Prof Alan W Black's group at CMU

Licensing Information

[More Information Needed]

Citation Information

@inproceedings{
    cmu_dog_emnlp18,
    title={A Dataset for Document Grounded Conversations},
    author={Zhou, Kangyan and Prabhumoye, Shrimai and Black, Alan W},
    year={2018},
    booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing}
}

Contributions

Thanks to @Ishan-Kumar2 for adding this dataset.