数据集:

projecte-aina/teca

语言:

ca

计算机处理:

monolingual

语言创建人:

found

批注创建人:

expert-generated

预印本库:

arxiv:2107.07903
中文

Dataset Card for TE-ca

Dataset Summary

TE-ca is a dataset of textual entailment in Catalan, which contains 21,163 pairs of premises and hypotheses, annotated according to the inference relation they have (implication, contradiction or neutral).

This dataset was developed by BSC TeMU as part of Projecte AINA , to enrich the Catalan Language Understanding Benchmark (CLUB) .

Supported Tasks and Leaderboards

Textual entailment, Text classification, Language Model

Languages

The dataset is in Catalan ( ca-CA ).

Dataset Structure

Data Instances

Three JSON files, one for each split.

Example:

    
    {
        "id": 3247,
        "premise": "L'ONU adopta a Marràqueix un pacte no vinculant per les migracions",
        "hypothesis": "S'acorden unes recomanacions per les persones migrades a Marràqueix",
        "label": "0"
    },
    {
        "id": 2825,
        "premise": "L'ONU adopta a Marràqueix un pacte no vinculant per les migracions",
        "hypothesis": "Les persones migrades seran acollides a Marràqueix",
        "label": "1"
    },
    {
        "id": 2431,
        "premise": "L'ONU adopta a Marràqueix un pacte no vinculant per les migracions",
        "hypothesis": "L'acord impulsat per l'ONU lluny de tancar-se",
        "label": "2"
    },

Data Fields

  • premise: text
  • hypothesis: text related to the premise
  • label: relation between premise and hypothesis:
    • 0: entailment
    • 1: neutral
    • 2: contradiction

Data Splits

  • dev.json: 2116 examples
  • test.json: 2117 examples
  • train.json: 16930 examples

Dataset Creation

Curation Rationale

We created this dataset to contribute to the development of language models in Catalan, a low-resource language.

Source Data

Source sentences are extracted from the Catalan Textual Corpus and from VilaWeb newswire.

Initial Data Collection and Normalization

12000 sentences from the BSC Catalan Textual Corpus , together with 6200 headers from the Catalan news site VilaWeb , were chosen randomly. We filtered them by different criteria, such as length and stand-alone intelligibility. For each selected text, we commissioned 3 hypotheses (one for each entailment category) to be written by a team of native annotators.

Some sentence pairs were excluded because of inconsistencies.

Who are the source language producers?

The Catalan Textual Corpus corpus consists of several corpora gathered from web crawling and public corpora. More information can be found here .

VilaWeb is a Catalan newswire.

Annotations

Annotation process

We commissioned 3 hypotheses (one for each entailment category) to be written by a team of annotators.

Who are the annotators?

Annotators are a team of native language collaborators from two independent companies.

Personal and Sensitive Information

No personal or sensitive information included.

Considerations for Using the Data

Social Impact of Dataset

We hope this dataset contributes to the development of language models in Catalan, a low-resource language.

Discussion of Biases

[N/A]

Other Known Limitations

[N/A]

Additional Information

Dataset Curators

Text Mining Unit (TeMU) at the Barcelona Supercomputing Center ( bsc-temu@bsc.es )

This work was funded by the Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya within the framework of Projecte AINA .

Licensing Information

This work is licensed under an Attribution-NonCommercial-NoDerivatives 4.0 International License .

Citation Information

@inproceedings{armengol-estape-etal-2021-multilingual,
    title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
    author = "Armengol-Estap{\'e}, Jordi  and
      Carrino, Casimiro Pio  and
      Rodriguez-Penagos, Carlos  and
      de Gibert Bonet, Ona  and
      Armentano-Oller, Carme  and
      Gonzalez-Agirre, Aitor  and
      Melero, Maite  and
      Villegas, Marta",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.437",
    doi = "10.18653/v1/2021.findings-acl.437",
    pages = "4933--4946",
}

DOI