数据集:
assin
任务:
文本分类语言:
pt计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
expert-generated源数据集:
original许可:
license:unknownThe ASSIN (Avaliação de Similaridade Semântica e INferência textual) corpus is a corpus annotated with pairs of sentences written in Portuguese that is suitable for the exploration of textual entailment and paraphrasing classifiers. The corpus contains pairs of sentences extracted from news articles written in European Portuguese (EP) and Brazilian Portuguese (BP), obtained from Google News Portugal and Brazil, respectively. To create the corpus, the authors started by collecting a set of news articles describing the same event (one news article from Google News Portugal and another from Google News Brazil) from Google News. Then, they employed Latent Dirichlet Allocation (LDA) models to retrieve pairs of similar sentences between sets of news articles that were grouped together around the same topic. For that, two LDA models were trained (for EP and for BP) on external and large-scale collections of unannotated news articles from Portuguese and Brazilian news providers, respectively. Then, the authors defined a lower and upper threshold for the sentence similarity score of the retrieved pairs of sentences, taking into account that high similarity scores correspond to sentences that contain almost the same content (paraphrase candidates), and low similarity scores correspond to sentences that are very different in content from each other (no-relation candidates). From the collection of pairs of sentences obtained at this stage, the authors performed some manual grammatical corrections and discarded some of the pairs wrongly retrieved. Furthermore, from a preliminary analysis made to the retrieved sentence pairs the authors noticed that the number of contradictions retrieved during the previous stage was very low. Additionally, they also noticed that event though paraphrases are not very frequent, they occur with some frequency in news articles. Consequently, in contrast with the majority of the currently available corpora for other languages, which consider as labels “neutral”, “entailment” and “contradiction” for the task of RTE, the authors of the ASSIN corpus decided to use as labels “none”, “entailment” and “paraphrase”. Finally, the manual annotation of pairs of sentences was performed by human annotators. At least four annotators were randomly selected to annotate each pair of sentences, which is done in two steps: (i) assigning a semantic similarity label (a score between 1 and 5, from unrelated to very similar); and (ii) providing an entailment label (one sentence entails the other, sentences are paraphrases, or no relation). Sentence pairs where at least three annotators do not agree on the entailment label were considered controversial and thus discarded from the gold standard annotations. The full dataset has 10,000 sentence pairs, half of which in Brazilian Portuguese (ptbr) and half in European Portuguese (ptpt). Either language variant has 2,500 pairs for training, 500 for validation and 2,000 for testing.
[More Information Needed]
The language supported is Portuguese.
An example from the ASSIN dataset looks as follows:
{ "entailment_judgment": 0, "hypothesis": "André Gomes entra em campo quatro meses depois de uma lesão na perna esquerda o ter afastado dos relvados.", "premise": "Relembre-se que o atleta estava afastado dos relvados desde maio, altura em que contraiu uma lesão na perna esquerda.", "relatedness_score": 3.5, "sentence_pair_id": 1 }
The data is split into train, validation and test set. The split sizes are as follow:
Train | Val | Test | |
---|---|---|---|
full | 5000 | 1000 | 4000 |
ptbr | 2500 | 500 | 2000 |
ptpt | 2500 | 500 | 2000 |
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
@inproceedings{fonseca2016assin, title={ASSIN: Avaliacao de similaridade semantica e inferencia textual}, author={Fonseca, E and Santos, L and Criscuolo, Marcelo and Aluisio, S}, booktitle={Computational Processing of the Portuguese Language-12th International Conference, Tomar, Portugal}, pages={13--15}, year={2016} }
Thanks to @jonatasgrosman for adding this dataset.