数据集:

GIL-UNAM/SpanishParaphraseCorpora

语言:

es

大小:

n<1K

许可:

cc0-1.0
中文

:page_with_curl: Spanish Paraphrase Corpora

Manually paraphrased corpus in Spanish

The Sushi Corpus

This corpus is designed to assess the similarity between a pair of texts and to evaluate different similarity measures, both for whole documents or for individual sentences. It is built around the subject of a Spanish blog article related to Sushi . Several volunteers (undergraduate, graduate, and Ph.D. students) were asked to intentionally reformulate or paraphrase this article. The paraphrase of the article was carried out on two levels, according to the rules:

  • Low level: Only lexical variation
  • High level: Lexical, syntactic, textual or discursive organization variation and fusion or separation of sentences.
  • No Paraphrase: Texts on the same theme and source as the original article, related to sushi.
  • No Sushi: Texts on different theme as the original article but with overlapping vocabulary were gathered. That is, texts not related to sushi, but with exactly the same vocabulary as the original one. Some volunteers wrote a free text using the same content words as the original.

:pencil: How to cite

If you use the corpus please cite the following articles:

  • Gómez-Adorno H., Bel-Enguix G., Sierra G., Torres-Moreno JM., Martinez R., Serrano P. (2020) Evaluation of Similarity Measures in a Benchmark for Spanish Paraphrasing Detection. In: Martínez-Villaseñor L., Herrera-Alcántara O., Ponce H., Castro-Espinoza F.A. (eds) Advances in Computational Intelligence. MICAI 2020. Lecture Notes in Computer Science, vol 12469. Springer, Cham. https://doi.org/10.1007/978-3-030-60887-3_19

  • Castro, B., Sierra, G., Torres-Moreno, J.M., Da Cunha, I.: El discurso y la semántica como recursos para la detección de similitud textual. In: Proceedings of the III RST Meeting (8th Brazilian Symposium in Information and Human Language Technology, STIL 2011). Brazilian Computer Society, Cuiabá (2011)

  • Aknowledgments

    The work was done with partial support of CONACYT project A1-S-27780 and UNAM-PAPIIT projects IA401219, TA100520, AG400119.

    License

    CC0 1.0 Universal