数据集:
projecte-aina/Parafraseja
Parafraseja is a dataset of 21,984 pairs of sentences with a label that indicates if they are paraphrases or not. The original sentences were collected from TE-ca and STS-ca . For each sentence, an annotator wrote a sentence that was a paraphrase and another that was not. The guidelines of this annotation are available.
This dataset is mainly intended to train models for paraphrase detection.
The dataset is in Catalan ( ca-CA ).
The dataset consists of pairs of sentences labelled with "Parafrasis" or "No Parafrasis" in a jsonl format.
  {
    "id": "te1_14977_1", 
    "source": "teca", 
    "original": "La 2a part consta de 23 cap\u00edtols, cadascun dels quals descriu un ocell diferent.", 
    "new": "La segona part consisteix en vint-i-tres cap\u00edtols, cada un dels quals descriu un ocell diferent.", 
    "label": "Parafrasis"
   }
 We created this corpus to contribute to the development of language models in Catalan, a low-resource language.
The original sentences of this dataset came from the STS-ca and the TE-ca .
Initial Data Collection and Normalization11,543 of the original sentences came from TE-ca, and 10,441 came from STS-ca.
Who are the source language producers?TE-ca and STS-ca come from the Catalan Textual Corpus , which consists of several corpora gathered from web crawling and public corpora, and Vilaweb , a Catalan newswire.
The dataset is annotated with the label "Parafrasis" or "No Parafrasis" for each pair of sentences.
Annotation processThe annotation process was done by a single annotator and reviewed by another.
Who are the annotators?The annotators were Catalan native speakers, with a background on linguistics.
No personal or sensitive information included.
We hope this corpus contributes to the development of language models in Catalan, a low-resource language.
We are aware that this data might contain biases. We have not applied any steps to reduce their impact.
[N/A]
Text Mining Unit (TeMU) at the Barcelona Supercomputing Center ( bsc-temu@bsc.es )
This work was funded by the Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya within the framework of Projecte AINA .
Creative Commons Attribution Non-commercial No-Derivatives 4.0 International .
[N/A]