数据集:
oeg/CelebA_RoBERTa_Sp
This corpus contains 250000 entries made up of a pair of sentences in Spanish and their respective similarity value in the range 0 to 1. This corpus was used in the training of the sentence-transformer library to improve the efficiency of the RoBERTa-large-bne base model. Each of the pairs of sentences are textual descriptions of the faces of the CelebA dataset, which were previously translated into Spanish. The process followed to generate it was:
First, a translation of the original English text into Spanish was made. The original corpus in English was obtained from the work Text2faceGAN
An algorithm was implemented that randomly selects two sentences from the translated corpus and calculates their similarity value. Spacy was used to obtain the similarity value of each pair of sentences.
Since both Spacy and most of the libraries to calculate sentence similarity only work in the English language, part of the algorithm consisted in additionally selecting the pair of sentences from the original corpus in English. Finally, the final training corpus for RoBERTa is defined by the Spanish text and the similarity score.
Each pair of sentences in Spanish and the similarity value separated by the character "|", are saved as entries of the new corpus.
The training of RoBERTa-large-bne + CelebA, using the present corpus was developed, resulting in the new model RoBERTa-celebA-Sp .
Each corpus entry is composed of:
Each component is separated by the character "|" with the structure:
SentenceA | Sentence B | similarity value
You can download the file with a .txt or .csv extension as appropriate.
Citing : If you used CelebA_RoBERTa_Sp corpus in your work, please cite the ???? :
This corpus is available under the Apache License 2.0 .
Universidad Nacional de Ingeniería , Ontology Engineering Group , Universidad Politécnica de Madrid.
See the full list of contributors here .