数据集:

hindi_discourse

任务:

文本分类

子任务:

multi-label-classification

语言:

计算机处理:

monolingual

大小:

1K<n<10K

语言创建人:

found

批注创建人:

other

源数据集:

original

其他:

discourse-analysis

许可:

other

数据集介绍文件清单

中文

Dataset Card for Discourse Analysis dataset

Dataset Summary

The Hindi Discourse Analysis dataset is a corpus for analyzing discourse modes present in its sentences.
It contains sentences from stories written by 11 famous authors from the 20th Century.
4-5 stories by each author have been selected which were available in the public domain resulting in a collection of 53 stories.
Most of these short stories were originally written in Hindi but some of them were written in other Indian languages and later translated to Hindi.

The corpus contains a total of 10472 sentences belonging to the following categories:

Argumentative
Descriptive
Dialogic
Informative
Narrative

Supported Tasks and Leaderboards

Discourse Analysis of Hindi.

Languages

Hindi

Dataset Structure

The dataset is structured into JSON format.

Data Instances

{'Story_no': 15, 'Sentence': ' गाँठ से साढ़े तीन रुपये लग गये, जो अब पेट में जाकर खनकते भी नहीं! जो तेरी करनी मालिक! ” “इसमें मालिक की क्या करनी है? ”', 'Discourse Mode': 'Dialogue'}

Data Fields

Sentence number, story number, sentence and discourse mode

Data Splits

Train: 9983

Dataset Creation

Curation Rationale

Present a new publicly available corpus consisting of sentences from short stories written in a low-resource language of Hindi having high quality annotation for five different discourse modes - argumentative, narrative, descriptive, dialogic and informative.
Perform a detailed analysis of the proposed annotated corpus and characterize the performance of different classification algorithms.

Source Data

Source of all the data points in this dataset is Hindi stories written by famous authors of Hindi literature.

Initial Data Collection and Normalization

All the data was collected from various Hindi websites.
We chose against crowd-sourcing the annotation pro- cess because we wanted to directly work with the an- notators for qualitative feedback and to also ensure high quality annotations.
We employed three native Hindi speakers with college level education for the an- notation task.
We first selected two random stories from our corpus and had the three annotators work on them independently and classify each sentence based on the discourse mode.
Please refer to this paper for detailed information: https://www.aclweb.org/anthology/2020.lrec-1.149/

Who are the source language producers?

Please refer to this paper for detailed information: https://www.aclweb.org/anthology/2020.lrec-1.149/

Annotations

Annotation process

The authors chose against crowd sourcing for labeling this dataset due to its highly sensitive nature.
The annotators are domain experts having degress in advanced clinical psychology and gender studies.
They were provided a guidelines document with instructions about each task and its definitions, labels and examples.
They studied the document, worked a few examples to get used to this annotation task.
They also provided feedback for improving the class definitions.
The annotation process is not mutually exclusive, implying that presence of one label does not mean the absence of the other one.

Who are the annotators?

The annotators were three native Hindi speakers with college level education.
Please refer to the accompnaying paper for a detailed annotation process.

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

As a future work we would also like to use the presented corpus to see how it could be further used in certain downstream tasks such as emotion analysis, machine translation, textual entailment, and speech sythesis for improving storytelling experience in Hindi language.

Discussion of Biases

[More Information Needed]

Other Known Limitations

We could not get the best performance using the deep learning model trained on the data, due to insufficient data for DL models.

Additional Information

Please refer to this link: https://github.com/midas-research/hindi-discourse

Dataset Curators

If you use the corpus in a product or application, then please credit the authors and [Multimodal Digital Media Analysis Lab - Indraprastha Institute of Information Technology, New Delhi] ( http://midas.iiitd.edu.in ) appropriately. Also, if you send us an email, we will be thrilled to know about how you have used the corpus.
If interested in commercial use of the corpus, send email to midas@iiitd.ac.in .
Multimodal Digital Media Analysis Lab - Indraprastha Institute of Information Technology, New Delhi, India disclaims any responsibility for the use of the corpus and does not provide technical support. However, the contact listed above will be happy to respond to queries and clarifications
Please feel free to send us an email:
- with feedback regarding the corpus.
- with information on how you have used the corpus.
- if interested in having us analyze your social media data.
- if interested in a collaborative research project.

Licensing Information

If you use the corpus in a product or application, then please credit the authors and [Multimodal Digital Media Analysis Lab - Indraprastha Institute of Information Technology, New Delhi] ( http://midas.iiitd.edu.in ) appropriately.

Citation Information

Please cite the following publication if you make use of the dataset: https://aclanthology.org/2020.lrec-1.149/

@inproceedings{dhanwal-etal-2020-annotated,
    title = "An Annotated Dataset of Discourse Modes in {H}indi Stories",
    author = "Dhanwal, Swapnil  and
      Dutta, Hritwik  and
      Nankani, Hitesh  and
      Shrivastava, Nilay  and
      Kumar, Yaman  and
      Li, Junyi Jessy  and
      Mahata, Debanjan  and
      Gosangi, Rakesh  and
      Zhang, Haimin  and
      Shah, Rajiv Ratn  and
      Stent, Amanda",
    booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.149",
    pages = "1191--1196",
    abstract = "In this paper, we present a new corpus consisting of sentences from Hindi short stories annotated for five different discourse modes argumentative, narrative, descriptive, dialogic and informative. We present a detailed account of the entire data collection and annotation processes. The annotations have a very high inter-annotator agreement (0.87 k-alpha). We analyze the data in terms of label distributions, part of speech tags, and sentence lengths. We characterize the performance of various classification algorithms on this dataset and perform ablation studies to understand the nature of the linguistic models suitable for capturing the nuances of the embedded discourse structures in the presented corpus.",
    language = "English",
    ISBN = "979-10-95546-34-4",
}

Contributions

Thanks to @duttahritwik for adding this dataset.

作者:

佚名

数据集大小:

15.88 KB