数据集:

GEM/TaTA

任务:

表格到文本

语言:

计算机处理:

yes

大小:

size_categories:unknown

语言创建人:

unknown

批注创建人:

none

源数据集:

original

预印本库:

arxiv:2211.00142 arxiv:2112.12870

其他:

data-to-text

许可:

cc-by-sa-4.0

数据集介绍文件清单

中文

Dataset Card for GEM/TaTA

Link to Main Data Card

You can find the main data card on the GEM Website .

Dataset Summary

Existing data-to-text generation datasets are mostly limited to English. Table-to-Text in African languages (TaTA) addresses this lack of data as the first large multilingual table-to-text dataset with a focus on African languages. TaTA was created by transcribing figures and accompanying text in bilingual reports by the Demographic and Health Surveys Program, followed by professional translation to make the dataset fully parallel. TaTA includes 8,700 examples in nine languages including four African languages (Hausa, Igbo, Swahili, and Yorùbá) and a zero-shot test language (Russian).

You can load the dataset via:

import datasets
data = datasets.load_dataset('GEM/TaTA')

The data loader can be found here .

website

Github

paper

ArXiv

authors

Sebastian Gehrmann, Sebastian Ruder , Vitaly Nikolaev, Jan A. Botha, Michael Chavinda, Ankur Parikh, Clara Rivera

Dataset Overview

Where to find the Data and its Documentation

Webpage

Github

Download

Github

Paper

ArXiv

BibTex

@misc{gehrmann2022TaTA,
  Author = {Sebastian Gehrmann and Sebastian Ruder and Vitaly Nikolaev and Jan A. Botha and Michael Chavinda and Ankur Parikh and Clara Rivera},
  Title = {TaTa: A Multilingual Table-to-Text Dataset for African Languages},
  Year = {2022},
  Eprint = {arXiv:2211.00142},
}

Contact Name

Sebastian Ruder

Contact Email

ruder@google.com

Has a Leaderboard?

yes

Leaderboard Link

Github

Leaderboard Details

The paper introduces a metric StATA which is trained on human ratings and which is used to rank approaches submitted to the leaderboard.

Languages and Intended Use

Multilingual?

yes

Covered Languages

English , Portuguese , Arabic , French , Hausa , Swahili (macrolanguage) , Igbo , Yoruba , Russian

Whose Language?

The language is taken from reports by the demographic and health surveys program.

License

cc-by-sa-4.0: Creative Commons Attribution Share Alike 4.0 International

Intended Use

The dataset poses significant reasoning challenges and is thus meant as a way to asses the verbalization and reasoning capabilities of structure-to-text models.

Primary Task

Data-to-Text

Communicative Goal

Summarize key information from a table in a single sentence.

Credit

Curation Organization Type(s)

industry

Curation Organization(s)

Google Research

Dataset Creators

Sebastian Gehrmann, Sebastian Ruder , Vitaly Nikolaev, Jan A. Botha, Michael Chavinda, Ankur Parikh, Clara Rivera

Funding

Google Research

Who added the Dataset to GEM?

Sebastian Gehrmann (Google Research)

Dataset Structure

Data Fields

example_id : The ID of the example. Each ID (e.g., AB20-ar-1 ) consists of three parts: the document ID, the language ISO 639-1 code, and the index of the table within the document.
title : The title of the table.
unit_of_measure : A description of the numerical value of the data. E.g., percentage of households with clean water.
chart_type : The kind of chart associated with the data. We consider the following (normalized) types: horizontal bar chart, map chart, pie graph, tables, line chart, pie chart, vertical chart type, line graph, vertical bar chart, and other.
was_translated : Whether the table was transcribed in the original language of the report or translated.
table_data : The table content is a JSON-encoded string of a two-dimensional list, organized by row, from left to right, starting from the top of the table. Number of items varies per table. Empty cells are given as empty string values in the corresponding table cell.
table_text : The sentences forming the description of each table are encoded as a JSON object. In the case of more than one sentence, these are separated by commas. Number of items varies per table.
linearized_input : A single string that contains the table content separated by vertical bars, i.e., |. Including title, unit of measurement, and the content of each cell including row and column headers in between brackets, i.e., (Medium Empowerment, Mali, 17.9).

Reason for Structure

The structure includes all available information for the infographics on which the dataset is based.

How were labels chosen?

Annotators looked through English text to identify sentences that describe an infographic. They then identified the corresponding location of the parallel non-English document. All sentences were extracted.

Example Instance

{
    "example_id": "FR346-en-39",
    "title": "Trends in early childhood mortality rates",
    "unit_of_measure": "Deaths per 1,000 live births for the 5-year period before the survey",
    "chart_type": "Line chart",
    "was_translated": "False",
    "table_data": "[[\"\", \"Child mortality\", \"Neonatal mortality\", \"Infant mortality\", \"Under-5 mortality\"], [\"1990 JPFHS\", 5, 21, 34, 39], [\"1997 JPFHS\", 6, 19, 29, 34], [\"2002 JPFHS\", 5, 16, 22, 27], [\"2007 JPFHS\", 2, 14, 19, 21], [\"2009 JPFHS\", 5, 15, 23, 28], [\"2012 JPFHS\", 4, 14, 17, 21], [\"2017-18 JPFHS\", 3, 11, 17, 19]]",
    "table_text": [
      "neonatal, infant, child, and under-5 mortality rates for the 5 years preceding each of seven JPFHS surveys (1990 to 2017-18).",
      "Under-5 mortality declined by half over the period, from 39 to 19 deaths per 1,000 live births.",
      "The decline in mortality was much greater between the 1990 and 2007 surveys than in the most recent period.",
      "Between 2012 and 2017-18, under-5 mortality decreased only modestly, from 21 to 19 deaths per 1,000 live births, and infant mortality remained stable at 17 deaths per 1,000 births."
    ],
    "linearized_input": "Trends in early childhood mortality rates | Deaths per 1,000 live births for the 5-year period before the survey | (Child mortality, 1990 JPFHS, 5) (Neonatal mortality, 1990 JPFHS, 21) (Infant mortality, 1990 JPFHS, 34) (Under-5 mortality, 1990 JPFHS, 39) (Child mortality, 1997 JPFHS, 6) (Neonatal mortality, 1997 JPFHS, 19) (Infant mortality, 1997 JPFHS, 29) (Under-5 mortality, 1997 JPFHS, 34) (Child mortality, 2002 JPFHS, 5) (Neonatal mortality, 2002 JPFHS, 16) (Infant mortality, 2002 JPFHS, 22) (Under-5 mortality, 2002 JPFHS, 27) (Child mortality, 2007 JPFHS, 2) (Neonatal mortality, 2007 JPFHS, 14) (Infant mortality, 2007 JPFHS, 19) (Under-5 mortality, 2007 JPFHS, 21) (Child mortality, 2009 JPFHS, 5) (Neonatal mortality, 2009 JPFHS, 15) (Infant mortality, 2009 JPFHS, 23) (Under-5 mortality, 2009 JPFHS, 28) (Child mortality, 2012 JPFHS, 4) (Neonatal mortality, 2012 JPFHS, 14) (Infant mortality, 2012 JPFHS, 17) (Under-5 mortality, 2012 JPFHS, 21) (Child mortality, 2017-18 JPFHS, 3) (Neonatal mortality, 2017-18 JPFHS, 11) (Infant mortality, 2017-18 JPFHS, 17) (Under-5 mortality, 2017-18 JPFHS, 19)"
  }

Data Splits

Train : Training set, includes examples with 0 or more references.
Validation : Validation set, includes examples with 3 or more references.
Test : Test set, includes examples with 3 or more references.
Ru : Russian zero-shot set. Includes English and Russian examples (Russian is not includes in any of the other splits).

Splitting Criteria

The same table across languages is always in the same split, i.e., if table X is in the test split in language A, it will also be in the test split in language B. In addition to filtering examples without transcribed table values, every example of the development and test splits has at least 3 references. From the examples that fulfilled these criteria, 100 tables were sampled for both development and test for a total of 800 examples each. A manual review process excluded a few tables in each set, resulting in a training set of 6,962 tables, a development set of 752 tables, and a test set of 763 tables.

There are tables without references, without values, and others that are very large. The dataset is distributed as-is, but the paper describes multiple strategies to deal with data issues.

Dataset in GEM

Rationale for Inclusion in GEM

Why is the Dataset in GEM?

There is no other multilingual data-to-text dataset that is parallel over languages. Moreover, over 70% of references in the dataset require reasoning and it is thus of very high quality and challenging for models.

Similar Datasets

yes

Unique Language Coverage

yes

Difference from other GEM datasets

More languages, parallel across languages, grounded in infographics, not centered on Western entities or source documents

Ability that the Dataset measures

reasoning, verbalization, content planning

GEM-Specific Curation

Modificatied for GEM?

Additional Splits?

Getting Started with the Task

Pointers to Resources

The background section of the paper provides a list of related datasets.

Technical Terms

data-to-text : Term that refers to NLP tasks in which the input is structured information and the output is natural language.

Previous Results

Metrics

Other: Other Metrics

Other Metrics

StATA : A new metric associated with TaTA that is trained on human judgments and which has a much higher correlation with them.

Proposed Evaluation

The creators used a human evaluation that measured attribution and reasoning capabilities of various models. Based on these ratings, they trained a new metric and showed that existing metrics fail to measure attribution.

Previous results available?

Dataset Curation

Original Curation

Original Curation Rationale

The curation rationale is to create a multilingual data-to-text dataset that is high-quality and challenging.

Communicative Goal

The communicative goal is to describe a table in a single sentence.

Sourced from Different Sources

Language Data

How was Language Data Obtained?

Found

Where was it found?

Single website

Language Producers

The language was produced by USAID as part of the Demographic and Health Surveys program ( https://dhsprogram.com/ ).

Topics Covered

The topics are related to fertility, family planning, maternal and child health, gender, and nutrition.

Data Validation

validated by crowdworker

Was Data Filtered?

not filtered

Structured Annotations

Additional Annotations?

expert created

Number of Raters

11<n<50

Rater Qualifications

Professional annotator who is a fluent speaker of the respective language

Raters per Training Example

Raters per Test Example

Annotation Service?

yes

Which Annotation Service

other

Annotation Values

The additional annotations are for system outputs and references and serve to develop metrics for this task.

Any Quality Control?

validated by data curators

Quality Control Details

Ratings were compared to a small (English) expert-curated set of ratings to ensure high agreement. There were additional rounds of training and feedback to annotators to ensure high quality judgments.

Consent

Any Consent Policy?

yes

Other Consented Downstream Use

In addition to data-to-text generation, the dataset can be used for translation or multimodal research.

Private Identifying Information (PII)

Contains PII?

no PII

Justification for no PII

The DHS program only publishes aggregate survey information and thus, no personal information is included.

Maintenance

Any Maintenance Plan?

Broader Social Context

Previous Work on the Social Impact of the Dataset

Usage of Models based on the Data

Impact on Under-Served Communities

Addresses needs of underserved Communities?

yes

Details on how Dataset Addresses the Needs

The dataset is focusing on data about African countries and the languages included in the dataset are spoken in Africa. It aims to improve the representation of African languages in the NLP and NLG communities.

Discussion of Biases

Any Documented Social Biases?

Are the Language Producers Representative of the Language?

The language producers for this dataset are those employed by the DHS program which is a US-funded program. While the data is focused on African countries, there may be implicit western biases in how the data is presented.

Considerations for Using the Data

PII Risks and Liability

Licenses

open license - commercial use allowed

Known Technical Limitations

Technical Limitations

While tables were transcribed in the available languages, the majority of the tables were published in English as the first language. Professional translators were used to translate the data, which makes it plausible that some translationese exists in the data. Moreover, it was unavoidable to collect reference sentences that are only partially entailed by the source tables.

Unsuited Applications

The domain of health reports includes potentially sensitive topics relating to reproduction, violence, sickness, and death. Perceived negative values could be used to amplify stereotypes about people from the respective regions or countries. The intended academic use of this dataset is to develop and evaluate models that neutrally report the content of these tables but not use the outputs to make value judgments, and these applications are thus discouraged.

作者:

GEM

数据集大小:

86.78 KB