数据集:

ruanchaves/porsimplessent

大小:

1K<n<10K
中文

Dataset Card for PorSimplesSent

Dataset Summary

PorSimplesSent is a Portuguese corpus of aligned sentence pairs and triplets created for the purpose of investigating sentence readability assessment in Portuguese. The dataset consists of 4,968 pairs and 1,141 triplets of sentences, combining the three levels of the PorSimples corpus: Original, Natural, and Strong. The dataset can be used for tasks such as sentence-pair classification, sentence retrieval, and readability assessment.

Supported Tasks and Leaderboards

The dataset supports the following tasks:

  • sentence-pair-classification : The dataset can be used to train a model for sentence-pair classification, which consists in determining whether one sentence is simpler than the other or if both sentences are equally simple. Success on this task is typically measured by achieving a high accuracy, f1, precision, and recall.

Languages

The dataset consists of sentence pairs in Portuguese.

Dataset Structure

Data Instances

{
  'sentence1': '-- Parece que o assassinato de civis iraquianos transformou-se em um fenômeno cotidiano e banal -- disse o presidente da Associação Iraquiana dos Direitos Humanos, Muayed al-Anbaki.',
  'sentence2': '-- Parece que o assassinato de civis iraquianos transformou-se em um fenômeno comum e banal -- disse o presidente da Associação Iraquiana dos Direitos Humanos, Muayed al-Anbaki.',
  'label': 2,
  'production_id': 3,
  'level': 'ORI->NAT',
  'changed': 'S',
  'split': 'N',
  'sentence_text_from': '-- Parece que o assassinato de civis iraquianos transformou-se em um fenômeno cotidiano e banal -- disse o presidente da Associação Iraquiana dos Direitos Humanos, Muayed al-Anbaki.',
  'sentence_text_to': '-- Parece que o assassinato de civis iraquianos transformou-se em um fenômeno comum e banal -- disse o presidente da Associação Iraquiana dos Direitos Humanos, Muayed al-Anbaki.'
}

Data Fields

The dataset has the following fields:

  • sentence1 : the first sentence in the sentence pair (string).
  • sentence2 : the second sentence in the sentence pair (string).
  • label : an integer indicating the relationship between the two sentences in the pair. The possible values are 0, 1, and 2, where 0 means that sentence1 is more simple than sentence2, 1 means that both sentences have the same level of complexity, and 2 means that sentence2 is more simple than sentence1 (int).
  • production_id : an integer identifier for each sentence pair (int).
  • level : a string indicating the level of simplification between the two sentences. The possible values are:
    • 'ORI->NAT' (original to natural)
    • 'NAT->STR' (natural to strong)
    • 'ORI->STR' (original to strong) (string).
  • changed : a string indicating whether the sentence was changed during the simplification process. The possible values are:
    • 'S' (changed)
    • 'N' (not changed) (string).
  • split : a string indicating whether the sentence suffered a split in this simplification level. The possible values are:
    • 'S' (split)
    • 'N' (not split) (string).
  • sentence_text_from : the raw text of the source sentence (string).
  • sentence_text_to : the raw text of the target sentence (string).

Data Splits

The dataset is split into three subsets: train, validation, and test. The sizes of each split are as follows:

Train Validation Test
Number of examples 4,976 1,446 1,697

The authors did not provide standard splits. We created the splits ourselves while ensuring that sentence pairs from the same document did not appear in multiple splits.

Additional Information

Dataset Curators

The PorSimplesSent dataset was created by Sidney Evaldo Leal, with guidance from his advisors Dra. Sandra Maria Aluísio and Dra. Magali Sanches Duran, during his master's degree at ICMC-USP. The Interinstitutional Center for Computational Linguistics - NILC (Núcleo Interinstitucional de Linguística Computacional) also contributed to the creation of the dataset.

Licensing Information

The PorSimplesSent dataset is released under the CC BY 4.0 license. The license terms can be found at https://creativecommons.org/licenses/by/4.0/ .

Citation Information

If you use this dataset in your work, please cite the following publication:\

@inproceedings{leal2018pss,
    author = {Sidney Evaldo Leal and Magali Sanches Duran and Sandra Maria Aluíso},
    title = {A Nontrivial Sentence Corpus for the Task of Sentence Readability Assessment in Portuguese},
    booktitle = {Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018)},
    year = {2018},
    pages = {401-413},
    month = {August},
    date = {20-26},
    address = {Santa Fe, New Mexico, USA},
}

Contributions

Thanks to @ruanchaves for adding this dataset.