数据集:
GEM/TaTA
You can find the main data card on the GEM Website .
Existing data-to-text generation datasets are mostly limited to English. Table-to-Text in African languages (TaTA) addresses this lack of data as the first large multilingual table-to-text dataset with a focus on African languages. TaTA was created by transcribing figures and accompanying text in bilingual reports by the Demographic and Health Surveys Program, followed by professional translation to make the dataset fully parallel. TaTA includes 8,700 examples in nine languages including four African languages (Hausa, Igbo, Swahili, and Yorùbá) and a zero-shot test language (Russian).
You can load the dataset via:
import datasets data = datasets.load_dataset('GEM/TaTA')
The data loader can be found here .
website paper authorsSebastian Gehrmann, Sebastian Ruder , Vitaly Nikolaev, Jan A. Botha, Michael Chavinda, Ankur Parikh, Clara Rivera
@misc{gehrmann2022TaTA, Author = {Sebastian Gehrmann and Sebastian Ruder and Vitaly Nikolaev and Jan A. Botha and Michael Chavinda and Ankur Parikh and Clara Rivera}, Title = {TaTa: A Multilingual Table-to-Text Dataset for African Languages}, Year = {2022}, Eprint = {arXiv:2211.00142}, }Contact Name
Sebastian Ruder
Contact Emailruder@google.com
Has a Leaderboard?yes
Leaderboard Link Leaderboard DetailsThe paper introduces a metric StATA which is trained on human ratings and which is used to rank approaches submitted to the leaderboard.
yes
Covered LanguagesEnglish , Portuguese , Arabic , French , Hausa , Swahili (macrolanguage) , Igbo , Yoruba , Russian
Whose Language?The language is taken from reports by the demographic and health surveys program.
Licensecc-by-sa-4.0: Creative Commons Attribution Share Alike 4.0 International
Intended UseThe dataset poses significant reasoning challenges and is thus meant as a way to asses the verbalization and reasoning capabilities of structure-to-text models.
Primary TaskData-to-Text
Communicative GoalSummarize key information from a table in a single sentence.
industry
Curation Organization(s)Google Research
Dataset CreatorsSebastian Gehrmann, Sebastian Ruder , Vitaly Nikolaev, Jan A. Botha, Michael Chavinda, Ankur Parikh, Clara Rivera
FundingGoogle Research
Who added the Dataset to GEM?Sebastian Gehrmann (Google Research)
The structure includes all available information for the infographics on which the dataset is based.
How were labels chosen?Annotators looked through English text to identify sentences that describe an infographic. They then identified the corresponding location of the parallel non-English document. All sentences were extracted.
Example Instance{ "example_id": "FR346-en-39", "title": "Trends in early childhood mortality rates", "unit_of_measure": "Deaths per 1,000 live births for the 5-year period before the survey", "chart_type": "Line chart", "was_translated": "False", "table_data": "[[\"\", \"Child mortality\", \"Neonatal mortality\", \"Infant mortality\", \"Under-5 mortality\"], [\"1990 JPFHS\", 5, 21, 34, 39], [\"1997 JPFHS\", 6, 19, 29, 34], [\"2002 JPFHS\", 5, 16, 22, 27], [\"2007 JPFHS\", 2, 14, 19, 21], [\"2009 JPFHS\", 5, 15, 23, 28], [\"2012 JPFHS\", 4, 14, 17, 21], [\"2017-18 JPFHS\", 3, 11, 17, 19]]", "table_text": [ "neonatal, infant, child, and under-5 mortality rates for the 5 years preceding each of seven JPFHS surveys (1990 to 2017-18).", "Under-5 mortality declined by half over the period, from 39 to 19 deaths per 1,000 live births.", "The decline in mortality was much greater between the 1990 and 2007 surveys than in the most recent period.", "Between 2012 and 2017-18, under-5 mortality decreased only modestly, from 21 to 19 deaths per 1,000 live births, and infant mortality remained stable at 17 deaths per 1,000 births." ], "linearized_input": "Trends in early childhood mortality rates | Deaths per 1,000 live births for the 5-year period before the survey | (Child mortality, 1990 JPFHS, 5) (Neonatal mortality, 1990 JPFHS, 21) (Infant mortality, 1990 JPFHS, 34) (Under-5 mortality, 1990 JPFHS, 39) (Child mortality, 1997 JPFHS, 6) (Neonatal mortality, 1997 JPFHS, 19) (Infant mortality, 1997 JPFHS, 29) (Under-5 mortality, 1997 JPFHS, 34) (Child mortality, 2002 JPFHS, 5) (Neonatal mortality, 2002 JPFHS, 16) (Infant mortality, 2002 JPFHS, 22) (Under-5 mortality, 2002 JPFHS, 27) (Child mortality, 2007 JPFHS, 2) (Neonatal mortality, 2007 JPFHS, 14) (Infant mortality, 2007 JPFHS, 19) (Under-5 mortality, 2007 JPFHS, 21) (Child mortality, 2009 JPFHS, 5) (Neonatal mortality, 2009 JPFHS, 15) (Infant mortality, 2009 JPFHS, 23) (Under-5 mortality, 2009 JPFHS, 28) (Child mortality, 2012 JPFHS, 4) (Neonatal mortality, 2012 JPFHS, 14) (Infant mortality, 2012 JPFHS, 17) (Under-5 mortality, 2012 JPFHS, 21) (Child mortality, 2017-18 JPFHS, 3) (Neonatal mortality, 2017-18 JPFHS, 11) (Infant mortality, 2017-18 JPFHS, 17) (Under-5 mortality, 2017-18 JPFHS, 19)" }Data Splits
The same table across languages is always in the same split, i.e., if table X is in the test split in language A, it will also be in the test split in language B. In addition to filtering examples without transcribed table values, every example of the development and test splits has at least 3 references. From the examples that fulfilled these criteria, 100 tables were sampled for both development and test for a total of 800 examples each. A manual review process excluded a few tables in each set, resulting in a training set of 6,962 tables, a development set of 752 tables, and a test set of 763 tables.
There are tables without references, without values, and others that are very large. The dataset is distributed as-is, but the paper describes multiple strategies to deal with data issues.
There is no other multilingual data-to-text dataset that is parallel over languages. Moreover, over 70% of references in the dataset require reasoning and it is thus of very high quality and challenging for models.
Similar Datasetsyes
Unique Language Coverageyes
Difference from other GEM datasetsMore languages, parallel across languages, grounded in infographics, not centered on Western entities or source documents
Ability that the Dataset measuresreasoning, verbalization, content planning
no
Additional Splits?no
The background section of the paper provides a list of related datasets.
Technical TermsOther: Other Metrics
Other MetricsStATA : A new metric associated with TaTA that is trained on human judgments and which has a much higher correlation with them.
Proposed EvaluationThe creators used a human evaluation that measured attribution and reasoning capabilities of various models. Based on these ratings, they trained a new metric and showed that existing metrics fail to measure attribution.
Previous results available?no
The curation rationale is to create a multilingual data-to-text dataset that is high-quality and challenging.
Communicative GoalThe communicative goal is to describe a table in a single sentence.
Sourced from Different Sourcesno
Found
Where was it found?Single website
Language ProducersThe language was produced by USAID as part of the Demographic and Health Surveys program ( https://dhsprogram.com/ ).
Topics CoveredThe topics are related to fertility, family planning, maternal and child health, gender, and nutrition.
Data Validationvalidated by crowdworker
Was Data Filtered?not filtered
expert created
Number of Raters11<n<50
Rater QualificationsProfessional annotator who is a fluent speaker of the respective language
Raters per Training Example0
Raters per Test Example1
Annotation Service?yes
Which Annotation Serviceother
Annotation ValuesThe additional annotations are for system outputs and references and serve to develop metrics for this task.
Any Quality Control?validated by data curators
Quality Control DetailsRatings were compared to a small (English) expert-curated set of ratings to ensure high agreement. There were additional rounds of training and feedback to annotators to ensure high quality judgments.
yes
Other Consented Downstream UseIn addition to data-to-text generation, the dataset can be used for translation or multimodal research.
no PII
Justification for no PIIThe DHS program only publishes aggregate survey information and thus, no personal information is included.
no
no
yes
Details on how Dataset Addresses the NeedsThe dataset is focusing on data about African countries and the languages included in the dataset are spoken in Africa. It aims to improve the representation of African languages in the NLP and NLG communities.
no
Are the Language Producers Representative of the Language?The language producers for this dataset are those employed by the DHS program which is a US-funded program. While the data is focused on African countries, there may be implicit western biases in how the data is presented.
open license - commercial use allowed
Copyright Restrictions on the Language Dataopen license - commercial use allowed
While tables were transcribed in the available languages, the majority of the tables were published in English as the first language. Professional translators were used to translate the data, which makes it plausible that some translationese exists in the data. Moreover, it was unavoidable to collect reference sentences that are only partially entailed by the source tables.
Unsuited ApplicationsThe domain of health reports includes potentially sensitive topics relating to reproduction, violence, sickness, and death. Perceived negative values could be used to amplify stereotypes about people from the respective regions or countries. The intended academic use of this dataset is to develop and evaluate models that neutrally report the content of these tables but not use the outputs to make value judgments, and these applications are thus discouraged.