oeg/CelebA_RoBERTa_Sp | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

数据集:

oeg/CelebA_RoBERTa_Sp

任务:

表格问答

问答

翻译

语言:

大小:

100M<n<1B

预印本库:

arxiv:1911.11378

其他:

CelebA Spanish celebFaces attributes celebFaces+attributes

数字对象标识符:

10.57967/hf/0447

许可:

apache-2.0

数据集介绍文件清单

中文

Corpus Summary

This corpus contains 250000 entries made up of a pair of sentences in Spanish and their respective similarity value in the range 0 to 1. This corpus was used in the training of the sentence-transformer library to improve the efficiency of the RoBERTa-large-bne base model. Each of the pairs of sentences are textual descriptions of the faces of the CelebA dataset, which were previously translated into Spanish. The process followed to generate it was:

First, a translation of the original English text into Spanish was made. The original corpus in English was obtained from the work Text2faceGAN
An algorithm was implemented that randomly selects two sentences from the translated corpus and calculates their similarity value. Spacy was used to obtain the similarity value of each pair of sentences.
Since both Spacy and most of the libraries to calculate sentence similarity only work in the English language, part of the algorithm consisted in additionally selecting the pair of sentences from the original corpus in English. Finally, the final training corpus for RoBERTa is defined by the Spanish text and the similarity score.
Each pair of sentences in Spanish and the similarity value separated by the character "|", are saved as entries of the new corpus.

The training of RoBERTa-large-bne + CelebA, using the present corpus was developed, resulting in the new model RoBERTa-celebA-Sp .

Corpus Fields

Each corpus entry is composed of:

Sentence A: Descriptive sentence of a CelebA face in Spanish.
Sentence B: Descriptive sentence of a CelebA face in Spanish.
Similarity Value: Similarity of sentence A and sentence B.

Each component is separated by the character "|" with the structure:

SentenceA | Sentence B | similarity value

You can download the file with a .txt or .csv extension as appropriate.

Citation information

Citing : If you used CelebA_RoBERTa_Sp corpus in your work, please cite the ???? :

License

This corpus is available under the Apache License 2.0 .

Autors

Universidad Nacional de Ingeniería , Ontology Engineering Group , Universidad Politécnica de Madrid.

Contributors

See the full list of contributors here .

作者:

oeg

数据集大小:

159.23 MB