数据集:

GEM/web_nlg

语言:

en

计算机处理:

unknown

语言创建人:

unknown

批注创建人:

unknown

源数据集:

original
中文

Dataset Card for GEM/web_nlg

Link to Main Data Card

You can find the main data card on the GEM Website .

Dataset Summary

WebNLG is a bi-lingual dataset (English, Russian) of parallel DBpedia triple sets and short texts that cover about 450 different DBpedia properties. The WebNLG data was originally created to promote the development of RDF verbalisers able to generate short text and to handle micro-planning (i.e., sentence segmentation and ordering, referring expression generation, aggregation); the goal of the task is to generate texts starting from 1 to 7 input triples which have entities in common (so the input is actually a connected Knowledge Graph). The dataset contains about 17,000 triple sets and 45,000 crowdsourced texts in English, and 7,000 triples sets and 19,000 crowdsourced texts in Russian. A challenging test set section with entities and/or properties that have not been seen at training time is available.

You can load the dataset via:

import datasets
data = datasets.load_dataset('GEM/web_nlg')

The data loader can be found here .

website

Website

paper

First Dataset Release , WebNLG Challenge 2017 Report , WebNLG Challenge 2020 Report

authors

The principle curator of the dataset is Anastasia Shimorina (Université de Lorraine / LORIA, France). Throughout the WebNLG releases, several people contributed to their construction: Claire Gardent (CNRS / LORIA, France), Shashi Narayan (Google, UK), Laura Perez-Beltrachini (University of Edinburgh, UK), Elena Khasanova, and Thiago Castro Ferreira (Federal University of Minas Gerais, Brazil).

Dataset Overview

Where to find the Data and its Documentation

Webpage

Website

Download

Gitlab

Paper

First Dataset Release , WebNLG Challenge 2017 Report , WebNLG Challenge 2020 Report

BibTex

Initial release of the dataset:

@inproceedings{gardent2017creating,
  author = 	"Gardent, Claire
        and Shimorina, Anastasia
        and Narayan, Shashi
        and Perez-Beltrachini, Laura",
  title = 	"Creating Training Corpora for NLG Micro-Planners",
  booktitle = 	"Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
  year = 	"2017",
  publisher = 	"Association for Computational Linguistics",
  pages = 	"179--188",
  location = 	"Vancouver, Canada",
  doi = 	"10.18653/v1/P17-1017",
  url = 	"http://www.aclweb.org/anthology/P17-1017"
}

The latest version 3.0:

@inproceedings{castro-ferreira20:bilin-bi-direc-webnl-shared,
  title={The 2020 Bilingual, Bi-Directional WebNLG+ Shared Task Overview and Evaluation Results (WebNLG+ 2020)},
  author={Castro Ferreira, Thiago and
                  Gardent, Claire and
          Ilinykh, Nikolai and
          van der Lee, Chris and
          Mille, Simon and
          Moussallem, Diego and
          Shimorina, Anastasia},
  booktitle = {Proceedings of the 3rd WebNLG Workshop on Natural Language Generation from the Semantic Web (WebNLG+ 2020)},
    pages = "55--76",
  year = 	 2020,
  address = 	 {Dublin, Ireland (Virtual)},
  publisher = {Association for Computational Linguistics}}
Contact Email

webnlg-challenge@inria.fr

Has a Leaderboard?

yes

Leaderboard Link

Website

Leaderboard Details

The model outputs are evaluated against the crowdsourced references; the leaderboard reports BLEU-4, METEOR, chrF++, TER, BERTScore and BLEURT scores.

Languages and Intended Use

Multilingual?

yes

Covered Languages

Russian , English

License

cc-by-nc-4.0: Creative Commons Attribution Non Commercial 4.0 International

Intended Use

The WebNLG dataset was created to promote the development ( i ) of RDF verbalisers and ( ii ) of microplanners able to handle a wide range of linguistic constructions. The dataset aims at covering knowledge in different domains ("categories"). The same properties and entities can appear in several categories.

Primary Task

Data-to-Text

Communicative Goal

A model should verbalize all and only the provided input triples in natural language.

Credit

Curation Organization Type(s)

academic

Curation Organization(s)

Université de Lorraine / LORIA, France, CNRS / LORIA, France, University of Edinburgh, UK, Federal University of Minas Gerais, Brazil

Dataset Creators

The principle curator of the dataset is Anastasia Shimorina (Université de Lorraine / LORIA, France). Throughout the WebNLG releases, several people contributed to their construction: Claire Gardent (CNRS / LORIA, France), Shashi Narayan (Google, UK), Laura Perez-Beltrachini (University of Edinburgh, UK), Elena Khasanova, and Thiago Castro Ferreira (Federal University of Minas Gerais, Brazil).

Funding

The dataset construction was funded by the French National Research Agency (ANR).

Who added the Dataset to GEM?

Simon Mille and Sebastian Gehrmann added the dataset and wrote the data card.

Dataset Structure

Data Fields

See official documentation .

entry : a data instance of the benchmark. Each entry has five attributes: a DBpedia category ( category ), entry ID ( eid ), shape, shape type, and triple set size ( size ).

  • shape : a string representation of the RDF tree with nested parentheses where X is a node (see Newick tree format ).

  • shape_type : a type of the tree shape. We identify three types of tree shapes:

    • chain (the object of one triple is the subject of the other);
    • sibling (triples with a shared subject);
    • mixed (both chain and sibling types present).
  • eid : an entry ID. It is unique only within a category and a size.

  • category : a DBpedia category (Astronaut, City, MusicalWork, Politician, etc.).

  • size : the number of RDF triples in a set. Ranges from 1 to 7.

Each entry has three fields: originaltripleset , modifiedtripleset , and lexs .

originaltripleset : a set of RDF triples as extracted from DBpedia . Each set of RDF triples is a tree. Triples have the subject-predicate-object structure.

modifiedtripleset : a set of RDF triples as presented to crowdworkers (for more details on modifications, see below).

Original and modified triples serve different purposes: the original triples — to link data to a knowledge base (DBpedia), whereas the modified triples — to ensure consistency and homogeneity throughout the data. To train models, the modified triples should be used.

lexs (shortened for lexicalisations): a natural language text verbalising the triples. Each lexicalisation has two attributes: a comment ( comment ), and a lexicalisation ID ( lid ). By default, comments have the value good , except rare cases when they were manually marked as toFix . That was done during the corpus creation, when it was seen that a lexicalisation did not exactly match a triple set.

Russian data has additional optional fields comparing to English:

<dbpedialinks> : RDF triples extracted from DBpedia between English and Russian entities by means of the property sameAs .

<links> : RDF triples created manually for some entities to serve as pointers to translators. There are two types of them:

  • with sameAs ( Spaniards | sameAs | испанцы )

  • with includes ( Tomatoes, guanciale, cheese, olive oil | includes | гуанчиале ). Those were mostly created for string literals to translate some parts of them.

Lexicalisations in the Russian WebNLG have a new parameter lang (values: en , ru ) because original English texts were kept in the Russian version (see the example above).

Example Instance
{
"entry": {
    "category": "Company",
    "size": "4",
    "shape": "(X (X) (X) (X) (X))",
    "shape_type": "sibling",
    "eid": "Id21",
    "lexs": [
        {
        "comment": "good",
        "lex": "Trane, which was founded on January 1st 1913 in La Crosse, Wisconsin, is based in Ireland. It has 29,000 employees.",
        "lid": "Id1"
        }
    ],
    "modifiedtripleset": [
        {
        "subject": "Trane",
        "property": "foundingDate",
        "object": "1913-01-01"
        },
        {
        "subject": "Trane",
        "property": "location",
        "object": "Ireland"
        },
        {
        "subject": "Trane",
        "property": "foundationPlace",
        "object": "La_Crosse,_Wisconsin"
        },
        {
        "subject": "Trane",
        "property": "numberOfEmployees",
        "object": "29000"
        }

    ],
    "originaltriplesets": {
        "originaltripleset": [
            {
            "subject": "Trane",
            "property": "foundingDate",
            "object": "1913-01-01"
            },
            {
            "subject": "Trane",
            "property": "location",
            "object": "Ireland"
            },
            {
            "subject": "Trane",
            "property": "foundationPlace",
            "object": "La_Crosse,_Wisconsin"
            },
            {
            "subject": "Trane",
            "property": "numberOfEmployees",
            "object": "29000"
            }
        ]
    }

    }
}

The XML-formatted example is here .

Data Splits
English (v3.0) Train Dev Test
triple sets 13,211 1,667 1,779
texts 35,426 4,464 5,150
properties 372 290 220
Russian (v3.0) Train Dev Test
triple sets 5,573 790 1,102
texts 14,239 2,026 2,780
properties 226 115 192

Dataset in GEM

Rationale for Inclusion in GEM

Why is the Dataset in GEM?

Due to the constrained generation task, this dataset can be used to evaluate very specific and narrow generation capabilities.

Similar Datasets

yes

Unique Language Coverage

yes

Difference from other GEM datasets

The RDF-triple format is unique to WebNLG.

Ability that the Dataset measures

surface realization

GEM-Specific Curation

Modificatied for GEM?

yes

GEM Modifications

other

Modification Details

No changes to the main content of the dataset. The version 3.0 of the dataset is used.

Additional Splits?

yes

Split Information

23 special test sets for WebNLG were added to the GEM evaluation suite, 12 for English and 11 for Russian. For both languages, we created subsets of the training and development sets of ~500 randomly selected inputs each. The inputs were sampled proportionally from each category.

Two types of transformations have been applied to WebNLG: (i) input scrambling (English and Russian) and (ii) numerical value replacements (English); in both cases, a subset of about 500 inputs was randomly selected. For (i), the order of the triples was randomly reassigned (each triple kept the same Subject-Property-Object internal order). For (ii), the change was performed respecting the format of the current cardinal value (e.g., alpha, integer, or floating-point) and replacing it with a new random value. The new number is lower-bounded between zero and upper bounded to be within to the highest power of 10 unit for the given value (e.g., replacing 54 would result in a random value between 0-100). Floating values maintain the degree of precision.

For both languages, we did identify different subsets of the test set that we could compare to each other so that we would have a better understanding of the results. There are currently 8 selections that we have made:

Selection 1 (size): input length. This selection corresponds to the number of predicates in the input. By comparing inputs of different lengths, we can see to what extent NLG systems are able to handle different input sizes. The table below provides the relevant frequencies. Please be aware that comparing selections with fewer than 100 items may result in unreliable comparisons.

Input length Frequency English Frequency Russian
1 369 254
2 349 200
3 350 214
4 305 214
5 213 159
6 114 32
7 79 29

Selection 2 (frequency): seen/unseen single predicates. This selection corresponds to the inputs with only one predicate. We compare which predicates are seen/unseen in the training data. The table below provides the relevant frequencies. Note that the comparison is only valid for English. Not for Russian, since there is only one example of unseen single predicates.

_ in training Frequency English Frequency Russian
Seen 297 253
Unseen 72 1

Selection 3 (frequency): seen/unseen combinations of predicates. This selection checks for all combinations of predicates whether that combination has been seen in the training data. For example: if the combination of predicates A and B is seen, that means that there is an input in the training data consisting of two triples, where one triple uses predicate A and the other uses predicate B. If the combination is unseen, then the converse is true. The table below provides the relevant frequencies.

_ in training Frequency English Frequency Russian
unseen 1295 354
seen 115 494

Selection 4 (frequency): seen/unseen arguments. This selection checks for all input whether or not all arg1s and arg2s in the input have been seen during the training phase. For this selection, Seen is the default. Only if all arg1 instances for a particular input are unseen, do we count the arg1s of the input as unseen. The same holds for arg2. So "seen" here really means that at least some of the arg1s or arg2s are seen in the input. The table below provides the relevant frequencies. Note that the comparison is only valid for English. Not for Russian, since there are very few examples of unseen combinations of predicates.

Arguments seen in training? Frequency English Frequency Russian
both_seen 518 1075
both_unseen 1177 4
arg1_unseen 56 19
arg2_unseen 28 4

Selection 5 (shape): repeated subjects. For this selection, the subsets are based on the times a subject is repeated in the input; it only takes into account the maximum number of times a subject is repeated, that is, if in one input a subject appears 3 times and a different subject 2 times, this input will be in the "3_subjects_same' split. Unique_subjects means all subjects are different.

Max num. of repeated subjects Frequency English Frequency Russian
unique_subjects 453 339
2_subjects_same 414 316
3_subjects_same 382 217
4_subjects_same 251 143
5_subjects_same 158 56
6_subjects_same 80 19
7_subjects_same 41 12

Selection 6 (shape): repeated objects. Same as for subjects above, but for objects. There are much less cases of repeated objects, so there are only two categories for this selection, unique_objects and some_objects_repeated; for the latter, we have up to 3 coreferring objects in English, and XXX in Russian.

Max num. of repeated objects Frequency English Frequency Russian
unique_objects 1654 1099
some_objects_same 125 3

Selection 7 (shape): repeated properties. Same as for objects above, but for properties; up to two properties can be the same in English, up to XXX in Russian.

Max num. of repeated properties Frequency English Frequency Russian
unique_properties 1510 986
some_properties_same 269 116

Selection 8 (shape): entities that appear both as subject and object. For this selection, we grouped together the inputs in which no entity is found as both subject and object, and on the other side inputs in which one or more entity/ies appear both as subject and as object. We found up to two such entities per input in English, and up to XXX in Russian.

Max num. of objects and subjects in common Frequency English Frequency Russian
unique_properties 1322 642
some_properties_same 457 460
Split Motivation

Robustness

Getting Started with the Task

Pointers to Resources

Dataset construction: main dataset paper , RDF triple extraction , Russian translation

WebNLG Challenge 2017: webpage , paper

WebNLG Challenge 2020: webpage , paper

Enriched version of WebNLG: repository , paper

Related research papers: webpage

Previous Results

Previous Results

Proposed Evaluation

For both languages, the participating systems are automatically evaluated in a multi-reference scenario. Each English hypothesis is compared to a maximum of 5 references, and each Russian one to a maximum of 7 references. On average, English data has 2.89 references per test instance, and Russian data has 2.52 references per instance.

In a human evaluation, example are uniformly sampled across size of triple sets and the following dimensions are assessed (on MTurk and Yandex.Toloka):

  • Data Coverage: Does the text include descriptions of all predicates presented in the data?
  • Relevance: Does the text describe only such predicates (with related subjects and objects), which are found in the data?
  • Correctness: When describing predicates which are found in the data, does the text mention correct the objects and adequately introduces the subject for this specific predicate?
  • Text Structure: Is the text grammatical, well-structured, written in acceptable English language?
  • Fluency: Is it possible to say that the text progresses naturally, forms a coherent whole and it is easy to understand the text?
  • For additional information like the instructions, we refer to the original paper.

    Previous results available?

    yes

    Other Evaluation Approaches

    We evaluated a wide range of models as part of the GEM benchmark.

    Relevant Previous Results

    Results can be found on the GEM website .

    Broader Social Context

    Previous Work on the Social Impact of the Dataset

    Usage of Models based on the Data

    yes - related tasks

    Social Impact Observations

    We do not foresee any negative social impact in particular from this dataset or task.

    Positive outlooks: Being able to generate good quality text from RDF data would permit, e.g., making this data more accessible to lay users, enriching existing text with information drawn from knowledge bases such as DBpedia or describing, comparing and relating entities present in these knowledge bases.

    Impact on Under-Served Communities

    Addresses needs of underserved Communities?

    no

    Discussion of Biases

    Any Documented Social Biases?

    yes

    Links and Summaries of Analysis Work

    This dataset is created using DBpedia RDF triples which naturally exhibit biases that have been found to exist in Wikipedia such as some forms of, e.g., gender bias.

    The choice of entities , described by RDF trees, was not controlled. As such, they may contain gender biases; for instance, all the astronauts described by RDF triples are male. Hence, in texts, pronouns he/him/his occur more often. Similarly, entities can be related to the Western culture more often than to other cultures.

    Are the Language Producers Representative of the Language?

    In English, the dataset is limited to the language that crowdraters speak. In Russian, the language is heavily biased by the translationese of the translation system that is post-edited.

    Considerations for Using the Data

    PII Risks and Liability

    Potential PII Risk

    There is no PII in this dataset.

    Licenses

    Copyright Restrictions on the Dataset

    non-commercial use only

    Copyright Restrictions on the Language Data

    public domain

    Known Technical Limitations

    Technical Limitations

    The quality of the crowdsourced references is limited, in particular in terms of fluency/naturalness of the collected texts.

    Russian data was machine-translated and then post-edited by crowdworkers, so some examples may still exhibit issues related to bad translations.

    Unsuited Applications

    Only a limited number of domains are covered in this dataset. As a result, it cannot be used as a general-purpose realizer.