数据集:

assin2

任务:

文本分类

子任务:

text-scoring natural-language-inference semantic-similarity-scoring

语言:

计算机处理:

monolingual

大小:

1K<n<10K

语言创建人:

found

批注创建人:

expert-generated

源数据集:

original

许可:

license:unknown

数据集介绍文件清单

中文

Dataset Card for ASSIN 2

Dataset Summary

The ASSIN 2 corpus is composed of rather simple sentences. Following the procedures of SemEval 2014 Task 1. The training and validation data are composed, respectively, of 6,500 and 500 sentence pairs in Brazilian Portuguese, annotated for entailment and semantic similarity. Semantic similarity values range from 1 to 5, and text entailment classes are either entailment or none. The test data are composed of approximately 3,000 sentence pairs with the same annotation. All data were manually annotated.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

The language supported is Portuguese.

Dataset Structure

Data Instances

An example from the ASSIN 2 dataset looks as follows:

{
  "entailment_judgment": 1,
  "hypothesis": "Uma criança está segurando uma pistola de água",
  "premise": "Uma criança risonha está segurando uma pistola de água e sendo espirrada com água",
  "relatedness_score": 4.5,
  "sentence_pair_id": 1
}

Data Fields

sentence_pair_id : a int64 feature.
premise : a string feature.
hypothesis : a string feature.
relatedness_score : a float32 feature.
entailment_judgment : a classification label, with possible values including NONE , ENTAILMENT .

Data Splits

The data is split into train, validation and test set. The split sizes are as follow:

Train	Val	Test
6500	500	2448

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

@inproceedings{real2020assin,
  title={The assin 2 shared task: a quick overview},
  author={Real, Livy and Fonseca, Erick and Oliveira, Hugo Goncalo},
  booktitle={International Conference on Computational Processing of the Portuguese Language},
  pages={406--412},
  year={2020},
  organization={Springer}
}

Contributions

Thanks to @jonatasgrosman for adding this dataset.

作者:

佚名

数据集大小:

12.7 KB