数据集:
dart
DART is a large dataset for open-domain structured data record to text generation. We consider the structured data record input as a set of RDF entity-relation triples, a format widely used for knowledge representation and semantics description. DART consists of 82,191 examples across different domains with each input being a semantic RDF triple set derived from data records in tables and the tree ontology of the schema, annotated with sentence descriptions that cover all facts in the triple set. This hierarchical, structured format with its open-domain nature differentiates DART from other existing table-to-text corpora.
The task associated to DART is text generation from data records that are RDF triplets:
BLEU | METEOR | TER | MoverScore | BERTScore | BLEURT | |
---|---|---|---|---|---|---|
BART | 37.06 | 0.36 | 0.57 | 0.44 | 0.92 | 0.22 |
This task has an active leaderboard which can be found here and ranks models based on the above metrics while also reporting.
The dataset is in english (en).
Here is an example from the dataset:
{'annotations': {'source': ['WikiTableQuestions_mturk'], 'text': ['First Clearing\tbased on Callicoon, New York and location at On NYS 52 1 Mi. Youngsville']}, 'subtree_was_extended': False, 'tripleset': [['First Clearing', 'LOCATION', 'On NYS 52 1 Mi. Youngsville'], ['On NYS 52 1 Mi. Youngsville', 'CITY_OR_TOWN', 'Callicoon, New York']]}
It contains one annotation where the textual description is 'First Clearing\tbased on Callicoon, New York and location at On NYS 52 1 Mi. Youngsville'. The RDF triplets considered to generate this description are in tripleset and are formatted as subject, predicate, object.
The different fields are:
There are three splits, train, validation and test:
train | validation | test | |
---|---|---|---|
N. Examples | 30526 | 2768 | 6959 |
Automatically generating textual descriptions from structured data inputs is crucial to improving the accessibility of knowledge bases to lay users.
DART comes from existing datasets that cover a variety of different domains while allowing to build a tree ontology and form RDF triple sets as semantic representations. The datasets used are WikiTableQuestions, WikiSQL, WebNLG and Cleaned E2E.
Initial Data Collection and NormalizationDART is constructed using multiple complementary methods: (1) human annotation on open-domain Wikipedia tables from WikiTableQuestions (Pasupat and Liang, 2015) and WikiSQL (Zhong et al., 2017), (2) automatic conversion of questions in WikiSQL to declarative sentences, and (3) incorporation of existing datasets including WebNLG 2017 (Gardent et al., 2017a,b; Shimorina and Gardent, 2018) and Cleaned E2E (Novikova et al., 2017b; Dušek et al., 2018, 2019)
Who are the source language producers?[More Information Needed]
DART is constructed using multiple complementary methods: (1) human annotation on open-domain Wikipedia tables from WikiTableQuestions (Pasupat and Liang, 2015) and WikiSQL (Zhong et al., 2017), (2) automatic conversion of questions in WikiSQL to declarative sentences, and (3) incorporation of existing datasets including WebNLG 2017 (Gardent et al., 2017a,b; Shimorina and Gardent, 2018) and Cleaned E2E (Novikova et al., 2017b; Dušek et al., 2018, 2019)
Annotation processThe two stage annotation process for constructing tripleset sentence pairs is based on a tree-structured ontology of each table. First, internal skilled annotators denote the parent column for each column header. Then, a larger number of annotators provide a sentential description of an automatically-chosen subset of table cells in a row.
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Under MIT license (see here )
@article{radev2020dart, title={DART: Open-Domain Structured Data Record to Text Generation}, author={Dragomir Radev and Rui Zhang and Amrit Rau and Abhinand Sivaprasad and Chiachun Hsieh and Nazneen Fatema Rajani and Xiangru Tang and Aadit Vyas and Neha Verma and Pranav Krishna and Yangxiaokang Liu and Nadia Irwanto and Jessica Pan and Faiaz Rahman and Ahmad Zaidi and Murori Mutuma and Yasin Tarabar and Ankit Gupta and Tao Yu and Yi Chern Tan and Xi Victoria Lin and Caiming Xiong and Richard Socher}, journal={arXiv preprint arXiv:2007.02871}, year={2020}
Thanks to @lhoestq for adding this dataset.