数据集:
DFKI-SLT/scidtb
SciDTB is a domain-specific discourse treebank annotated on scientific articles written in English-language. Different from widely-used RST-DT and PDTB, SciDTB uses dependency trees to represent discourse structure, which is flexible and simplified to some extent but do not sacrifice structural integrity. Furthermore, this treebank is made as a benchmark for evaluating discourse dependency parsers. This dataset can benefit many downstream NLP tasks such as machine translation and automatic summarization.
[Needs More Information]
English.
A typical data point consist of root which is a list of nodes in dependency tree. Each node in the list has four fields: id containing id for the node, parent contains id of the parent node, text refers to the span that is part of the current node and finally relation represents relation between current node and parent node.
An example from SciDTB train set is given below:
{ "root": [ { "id": 0, "parent": -1, "text": "ROOT", "relation": "null" }, { "id": 1, "parent": 0, "text": "We propose a neural network approach ", "relation": "ROOT" }, { "id": 2, "parent": 1, "text": "to benefit from the non-linearity of corpus-wide statistics for part-of-speech ( POS ) tagging . <S>", "relation": "enablement" }, { "id": 3, "parent": 1, "text": "We investigated several types of corpus-wide information for the words , such as word embeddings and POS tag distributions . <S>", "relation": "elab-aspect" }, { "id": 4, "parent": 5, "text": "Since these statistics are encoded as dense continuous features , ", "relation": "cause" }, { "id": 5, "parent": 3, "text": "it is not trivial to combine these features ", "relation": "elab-addition" }, { "id": 6, "parent": 5, "text": "comparing with sparse discrete features . <S>", "relation": "comparison" }, { "id": 7, "parent": 1, "text": "Our tagger is designed as a combination of a linear model for discrete features and a feed-forward neural network ", "relation": "elab-aspect" }, { "id": 8, "parent": 7, "text": "that captures the non-linear interactions among the continuous features . <S>", "relation": "elab-addition" }, { "id": 9, "parent": 10, "text": "By using several recent advances in the activation functions for neural networks , ", "relation": "manner-means" }, { "id": 10, "parent": 1, "text": "the proposed method marks new state-of-the-art accuracies for English POS tagging tasks . <S>", "relation": "evaluation" } ] }
More such raw data instance can be found here
Dataset consists of three splits: train , dev and test .
Train | Valid | Test |
---|---|---|
743 | 154 | 152 |
[Needs More Information]
[Needs More Information]
Who are the source language producers?[Needs More Information]
More information can be found here
Who are the annotators?[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
@inproceedings{yang-li-2018-scidtb, title = "{S}ci{DTB}: Discourse Dependency {T}ree{B}ank for Scientific Abstracts", author = "Yang, An and Li, Sujian", booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)", month = jul, year = "2018", address = "Melbourne, Australia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P18-2071", doi = "10.18653/v1/P18-2071", pages = "444--449", abstract = "Annotation corpus for discourse relations benefits NLP tasks such as machine translation and question answering. In this paper, we present SciDTB, a domain-specific discourse treebank annotated on scientific articles. Different from widely-used RST-DT and PDTB, SciDTB uses dependency trees to represent discourse structure, which is flexible and simplified to some extent but do not sacrifice structural integrity. We discuss the labeling framework, annotation workflow and some statistics about SciDTB. Furthermore, our treebank is made as a benchmark for evaluating discourse dependency parsers, on which we provide several baselines as fundamental work.", }