数据集:
webnlg/challenge-2023
The WebNLG 2023 challenge focuses on four under-resourced languages which are severely under-represented in research on text generation, namely Maltese, Irish, Breton and Welsh. In addition, WebNLG 2023 once again includes Russian, which was first featured in WebNLG 2020.
The challenge focuses on RDF-to-text generation, similarly to WebNLG 2017 but targeting Breton, Irish, Maltese, Welsh, and Russian;
The challenge consists in mapping data to text. The training data consists of Data/Text pairs where the data is a set of triples extracted from DBpedia and the text is a verbalisation of these triples.
For instance, given the 4 RDF triples:
<entry category="Company" eid="Id21" shape="(X (X) (X) (X) (X))" shape_type="sibling" size="4"> <modifiedtripleset> <mtriple>Trane | foundingDate | 1913-01-01</mtriple> <mtriple>Trane | location | Ireland</mtriple> <mtriple>Trane | foundationPlace | La_Crosse,_Wisconsin</mtriple> <mtriple>Trane | numberOfEmployees | 29000</mtriple> </modifiedtripleset> </entry>
the aim is to generate a text such as (English text):
Trane, which was founded on January 1st 1913 in La Crosse, Wisconsin, is based in Ireland. It has 29,000 employees.
or (Russian text):
Компания "Тране", основанная 1 января 1913 года в Ла-Кроссе в штате Висконсин, находится в Ирландии. В компании работают 29 тысяч человек.
As the example illustrates, the task involves specific NLG subtasks such as sentence segmentation (how to chunk the input data into sentences), lexicalisation (of the DBpedia properties), aggregation (how to avoid repetitions) and surface realisation (how to build a syntactically correct and natural sounding text).
The dataset supports a Structured to Text task which requires a model takes a set of RDF (Resource Description Format) triples from a database (DBpedia) of the form (subject, property, object) as input and write out a natural language sentence expressing the information contained in the triples.
The dataset is used in the WebNLG 2023 challenge.
Results are evaluated with automatic metrics: BLEU , METEOR , ChrF++ , TER and BERTscore . Additionally, result are assessed according to criteria such as grammaticality/correctness, appropriateness/adequacy, fluency/naturalness, etc., by native speakers.
The dataset comprises Breton ( br ), Welsh ( cy ), Irish ( ga ), Maltese ( mt ) and Russian ( ru ) languages.
A typical example contains the original RDF triples in the set, a modified version which presented to crowd workers, and a set of possible verbalizations for this set of triples:
{'category': 'Airport', 'size': 1, 'eid': '1', 'original_triple_sets': {'otriple_set': [['Aarhus_Airport | cityServed | "Aarhus, Denmark"@en']]}, 'modified_triple_sets': {'mtriple_set': [['Aarhus_Airport | cityServed | "Aarhus, Denmark"']]}, 'shape': '(X (X))', 'shape_type': 'NA', 'lex': {'comment': ['good', 'good', '', ''], 'lid': ['Id1', 'Id2', 'Id3', 'Id3'], 'text': ['Aarhus a zo an aro-vezh Aarhus.', "Aarhus a servijit ar c'hêr Aarhus.", 'The Aarhus is the airport of Aarhus, Denmark.', 'Aarhus Airport serves the city of Aarhus, Denmark.'], 'lang': ['br', 'br', 'en', 'en']}}
The following fields can be found in the instances:
The dataset is split into train and validation:
language | train | validation |
---|---|---|
br | 13211 | 1399 |
cy | 13211 | 1665 |
ga | 13211 | 1665 |
mt | 13211 | 1665 |
ru | 5573 | 790 |
The WebNLG dataset was created to promote the development (i) of RDF verbalisers and (ii) of microplanners able to handle a wide range of linguistic constructions. The dataset aims at covering knowledge in different domains ("categories"). The same properties and entities can appear in several categories.
The data was compiled from raw DBpedia triples. This paper explains how the triples were selected.
Initial Data Collection and NormalizationInitial triples extracted from DBpedia were modified in several ways. See official documentation for the most frequent changes that have been made. An original tripleset and a modified tripleset usually represent a one-to-one mapping. However, there are cases with many-to-one mappings when several original triplesets are mapped to one modified tripleset.
Entities that served as roots of RDF trees are listed in this file .
The English WebNLG 2020 dataset (v3.0) for training comprises data-text pairs for 16 distinct DBpedia categories:
The Russian dataset (v3.0) comprises data-text pairs for 9 distinct categories: Airport, Astronaut, Building, CelestialBody, ComicsCharacter, Food, Monument, SportsTeam, and University.
Who are the source language producers?There are no source texts, all textual material was compiled during the annotation process.
Annotators were first asked to create sentences that verbalise single triples. In a second round, annotators were asked to combine single-triple sentences together into sentences that cover 2 triples. And so on until 7 triples. Quality checks were performed to ensure the quality of the annotations. See Section 3.3 in the dataset paper .
Russian data was translated from English with an MT system and then was post-edited by crowdworkers. See Section 2.2 of this paper .
Who are the annotators?All references were collected through crowdsourcing platforms (CrowdFlower/Figure 8 and Amazon Mechanical Turk). For Russian, post-editing was done using the Yandex.Toloka crowdsourcing platform.
Neither the dataset as published or the annotation process involves the collection or sharing of any kind of personal / demographic information.
We do not foresee any negative social impact in particular from this dataset or task.
Positive outlooks: Being able to generate good quality text from RDF data would permit, e.g., making this data more accessible to lay users, enriching existing text with information drawn from knowledge bases such as DBpedia or describing, comparing and relating entities present in these knowledge bases.
This dataset is created using DBpedia RDF triples which naturally exhibit biases that have been found to exist in Wikipedia such as some forms of, e.g., gender bias.
The choice of entities , described by RDF trees, was not controlled. As such, they may contain gender biases; for instance, all the astronauts described by RDF triples are male. Hence, in texts, pronouns he/him/his occur more often. Similarly, entities can be related to the Western culture more often than to other cultures.
The quality of the crowdsourced references is limited, in particular in terms of fluency/naturalness of the collected texts.
Russian data was machine-translated and then post-edited by crowdworkers, so some examples may still exhibit issues related to bad translations.
The principle curator of the dataset is Anastasia Shimorina (Université de Lorraine / LORIA, France). Throughout the WebNLG releases, several people contributed to their construction: Claire Gardent (CNRS / LORIA, France), Shashi Narayan (Google, UK), Laura Perez-Beltrachini (University of Edinburgh, UK), Elena Khasanova, and Thiago Castro Ferreira (Federal University of Minas Gerais, Brazil). The dataset construction was funded by the French National Research Agency (ANR).
The dataset uses the cc-by-nc-sa-4.0 license. The source DBpedia project uses the cc-by-sa-3.0 and gfdl-1.1 licenses.
If you use the WebNLG corpus, cite:
@inproceedings{web_nlg, author = {Claire Gardent and Anastasia Shimorina and Shashi Narayan and Laura Perez{-}Beltrachini}, editor = {Regina Barzilay and Min{-}Yen Kan}, title = {Creating Training Corpora for {NLG} Micro-Planners}, booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, {ACL} 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers}, pages = {179--188}, publisher = {Association for Computational Linguistics}, year = {2017}, url = {https://doi.org/10.18653/v1/P17-1017}, doi = {10.18653/v1/P17-1017} }
Thanks to @albertvillanova for adding this dataset.