数据集:
ruanchaves/porsimplessent
大小:
1K<n<10KPorSimplesSent is a Portuguese corpus of aligned sentence pairs and triplets created for the purpose of investigating sentence readability assessment in Portuguese. The dataset consists of 4,968 pairs and 1,141 triplets of sentences, combining the three levels of the PorSimples corpus: Original, Natural, and Strong. The dataset can be used for tasks such as sentence-pair classification, sentence retrieval, and readability assessment.
The dataset supports the following tasks:
The dataset consists of sentence pairs in Portuguese.
{ 'sentence1': '-- Parece que o assassinato de civis iraquianos transformou-se em um fenômeno cotidiano e banal -- disse o presidente da Associação Iraquiana dos Direitos Humanos, Muayed al-Anbaki.', 'sentence2': '-- Parece que o assassinato de civis iraquianos transformou-se em um fenômeno comum e banal -- disse o presidente da Associação Iraquiana dos Direitos Humanos, Muayed al-Anbaki.', 'label': 2, 'production_id': 3, 'level': 'ORI->NAT', 'changed': 'S', 'split': 'N', 'sentence_text_from': '-- Parece que o assassinato de civis iraquianos transformou-se em um fenômeno cotidiano e banal -- disse o presidente da Associação Iraquiana dos Direitos Humanos, Muayed al-Anbaki.', 'sentence_text_to': '-- Parece que o assassinato de civis iraquianos transformou-se em um fenômeno comum e banal -- disse o presidente da Associação Iraquiana dos Direitos Humanos, Muayed al-Anbaki.' }
The dataset has the following fields:
The dataset is split into three subsets: train, validation, and test. The sizes of each split are as follows:
Train | Validation | Test | |
---|---|---|---|
Number of examples | 4,976 | 1,446 | 1,697 |
The authors did not provide standard splits. We created the splits ourselves while ensuring that sentence pairs from the same document did not appear in multiple splits.
The PorSimplesSent dataset was created by Sidney Evaldo Leal, with guidance from his advisors Dra. Sandra Maria Aluísio and Dra. Magali Sanches Duran, during his master's degree at ICMC-USP. The Interinstitutional Center for Computational Linguistics - NILC (Núcleo Interinstitucional de Linguística Computacional) also contributed to the creation of the dataset.
The PorSimplesSent dataset is released under the CC BY 4.0 license. The license terms can be found at https://creativecommons.org/licenses/by/4.0/ .
If you use this dataset in your work, please cite the following publication:\
@inproceedings{leal2018pss, author = {Sidney Evaldo Leal and Magali Sanches Duran and Sandra Maria Aluíso}, title = {A Nontrivial Sentence Corpus for the Task of Sentence Readability Assessment in Portuguese}, booktitle = {Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018)}, year = {2018}, pages = {401-413}, month = {August}, date = {20-26}, address = {Santa Fe, New Mexico, USA}, }
Thanks to @ruanchaves for adding this dataset.