数据集:

GEM/xwikis

计算机处理:

unknown

语言创建人:

unknown

批注创建人:

found

源数据集:

original

预印本库:

arxiv:2202.09583
中文

Dataset Card for GEM/xwikis

Link to Main Data Card

You can find the main data card on the GEM Website .

Dataset Summary

The XWikis Corpus provides datasets with different language pairs and directions for cross-lingual and multi-lingual abstractive document summarisation.

You can load the dataset via:

import datasets
data = datasets.load_dataset('GEM/xwikis')

The data loader can be found here .

website

Github

paper

https://arxiv.org/abs/2202.09583

authors

Laura Perez-Beltrachini (University of Edinburgh)

Dataset Overview

Where to find the Data and its Documentation

Webpage

Github

Paper

https://arxiv.org/abs/2202.09583

BibTex
@InProceedings{clads-emnlp,
  author =      "Laura Perez-Beltrachini and Mirella Lapata",
  title =       "Models and Datasets for Cross-Lingual Summarisation",
  booktitle =   "Proceedings of The 2021 Conference on Empirical Methods in Natural Language Processing ",
  year =        "2021",
  address =     "Punta Cana, Dominican Republic",
}
Contact Name

Laura Perez-Beltrachini

Contact Email

lperez@ed.ac.uk

Has a Leaderboard?

no

Languages and Intended Use

Multilingual?

yes

Covered Languages

German , English , French , Czech , Chinese

License

cc-by-sa-4.0: Creative Commons Attribution Share Alike 4.0 International

Intended Use

Cross-lingual and Multi-lingual single long input document abstractive summarisation.

Primary Task

Summarization

Communicative Goal

Entity descriptive summarisation, that is, generate a summary that conveys the most salient facts of a document related to a given entity.

Credit

Curation Organization Type(s)

academic

Dataset Creators

Laura Perez-Beltrachini (University of Edinburgh)

Who added the Dataset to GEM?

Laura Perez-Beltrachini (University of Edinburgh) and Ronald Cardenas (University of Edinburgh)

Dataset Structure

Data Splits

For each language pair and direction there exists a train/valid/test split. The test split is a sample of size 7k from the intersection of titles existing in the four languages (cs,fr,en,de). Train/valid are randomly split.

Dataset in GEM

Rationale for Inclusion in GEM

Similar Datasets

no

GEM-Specific Curation

Modificatied for GEM?

no

Additional Splits?

no

Getting Started with the Task

Previous Results

Previous Results

Measured Model Abilities
  • identification of entity salient information
  • translation
  • multi-linguality
  • cross-lingual transfer, zero-shot, few-shot
Metrics

ROUGE

Previous results available?

yes

Other Evaluation Approaches

ROUGE-1/2/L

Dataset Curation

Original Curation

Sourced from Different Sources

no

Language Data

How was Language Data Obtained?

Found

Where was it found?

Single website

Data Validation

other

Was Data Filtered?

not filtered

Structured Annotations

Additional Annotations?

found

Annotation Service?

no

Annotation Values

The input documents have section structure information.

Any Quality Control?

validated by another rater

Quality Control Details

Bilingual annotators assessed the content overlap of source document and target summaries.

Consent

Any Consent Policy?

no

Private Identifying Information (PII)

Contains PII?

no PII

Maintenance

Any Maintenance Plan?

no

Broader Social Context

Previous Work on the Social Impact of the Dataset

Usage of Models based on the Data

no

Impact on Under-Served Communities

Addresses needs of underserved Communities?

no

Discussion of Biases

Any Documented Social Biases?

no

Considerations for Using the Data

PII Risks and Liability

Licenses

Copyright Restrictions on the Dataset

public domain

Copyright Restrictions on the Language Data

public domain

Known Technical Limitations