数据集:
enriched_web_nlg
任务:
表格到文本子任务:
rdf-to-text计算机处理:
monolingual大小:
1K<n<10K语言创建人:
crowdsourced批注创建人:
found源数据集:
extended|other-web-nlg许可:
cc-by-sa-4.0The WebNLG challenge consists in mapping data to text. The training data consists of Data/Text pairs where the data is a set of triples extracted from DBpedia and the text is a verbalisation of these triples. For instance, given the 3 DBpedia triples shown in (a), the aim is to generate a text such as (b). It is a valuable resource and benchmark for the Natural Language Generation (NLG) community. However, as other NLG benchmarks, it only consists of a collection of parallel raw representations and their corresponding textual realizations. This work aimed to provide intermediate representations of the data for the development and evaluation of popular tasks in the NLG pipeline architecture, such as Discourse Ordering, Lexicalization, Aggregation and Referring Expression Generation.
The dataset supports a other-rdf-to-text task which requires a model takes a set of RDF (Resource Description Format) triples from a database (DBpedia) of the form (subject, property, object) as input and write out a natural language sentence expressing the information contained in the triples.
The dataset is presented in two versions: English (config en ) and German (config de )
A typical example contains the original RDF triples in the set, a modified version which presented to crowd workers, and a set of possible verbalizations for this set of triples:
{ 'category': 'Politician', 'eid': 'Id10', 'lex': {'comment': ['good', 'good', 'good'], 'lid': ['Id1', 'Id2', 'Id3'], 'text': ['World War II had Chiang Kai-shek as a commander and United States Army soldier Abner W. Sibal.', 'Abner W. Sibal served in the United States Army during the Second World War and during that war Chiang Kai-shek was one of the commanders.', 'Abner W. Sibal, served in the United States Army and fought in World War II, one of the commanders of which, was Chiang Kai-shek.']}, 'modified_triple_sets': {'mtriple_set': [['Abner_W._Sibal | battle | World_War_II', 'World_War_II | commander | Chiang_Kai-shek', 'Abner_W._Sibal | militaryBranch | United_States_Army']]}, 'original_triple_sets': {'otriple_set': [['Abner_W._Sibal | battles | World_War_II', 'World_War_II | commander | Chiang_Kai-shek', 'Abner_W._Sibal | branch | United_States_Army'], ['Abner_W._Sibal | militaryBranch | United_States_Army', 'Abner_W._Sibal | battles | World_War_II', 'World_War_II | commander | Chiang_Kai-shek']]}, 'shape': '(X (X) (X (X)))', 'shape_type': 'mixed', 'size': 3}
The following fields can be found in the instances:
The en version has train , test and dev splits; the de version, only train and dev .
Natural Language Generation (NLG) is the process of automatically converting non-linguistic data into a linguistic output format (Reiter andDale, 2000; Gatt and Krahmer, 2018). Recently, the field has seen an increase in the number of available focused data resources as E2E (Novikova et al., 2017), ROTOWIRE(Wise-man et al., 2017) and WebNLG (Gardent et al.,2017a,b) corpora. Although theses recent releases are highly valuable resources for the NLG community in general,nall of them were designed to work with end-to-end NLG models. Hence, they consist of a collection of parallel raw representations and their corresponding textual realizations. No intermediate representations are available so researchersncan straight-forwardly use them to develop or evaluate popular tasks in NLG pipelines (Reiter and Dale, 2000), such as Discourse Ordering, Lexicalization, Aggregation, Referring Expression Generation, among others. Moreover, these new corpora, like many other resources in Computational Linguistics more in general, are only available in English, limiting the development of NLG-applications to other languages.
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
The dataset uses the cc-by-nc-sa-4.0 license. The source DBpedia project uses the cc-by-sa-3.0 and gfdl-1.1 licenses.
@InProceedings{ferreiraetal2018, author = "Castro Ferreira, Thiago and Moussallem, Diego and Wubben, Sander and Krahmer, Emiel", title = "Enriching the WebNLG corpus", booktitle = "Proceedings of the 11th International Conference on Natural Language Generation", year = "2018", series = {INLG'18}, publisher = "Association for Computational Linguistics", address = "Tilburg, The Netherlands", } @inproceedings{web_nlg, author = {Claire Gardent and Anastasia Shimorina and Shashi Narayan and Laura Perez{-}Beltrachini}, editor = {Regina Barzilay and Min{-}Yen Kan}, title = {Creating Training Corpora for {NLG} Micro-Planners}, booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, {ACL} 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers}, pages = {179--188}, publisher = {Association for Computational Linguistics}, year = {2017}, url = {https://doi.org/10.18653/v1/P17-1017}, doi = {10.18653/v1/P17-1017} }
Thanks to @TevenLeScao for adding this dataset.