数据集:

KBLab/overlim

任务:

文本分类

子任务:

natural-language-inference semantic-similarity-classification sentiment-classification

语言:

计算机处理:

translation

大小:

size_categories:unknown

语言创建人:

other

批注创建人:

other

源数据集:

extended|glue extended|super_glue

其他:

qa-nli paraphrase-identification

许可:

cc-by-4.0

数据集介绍文件清单

中文

Dataset Card for OverLim

Dataset Summary

The OverLim dataset contains some of the GLUE and SuperGLUE tasks automatically translated to Swedish, Danish, and Norwegian (bokmål), using the OpusMT models for MarianMT.

The translation quality was not manually checked and may thus be faulty. Results on these datasets should thus be interpreted carefully.

If you want to have an easy script to train and evaluate your models have a look here

Supported Tasks and Leaderboards

The data contains the following tasks from GLUE and SuperGLUE:

GLUE
- mnli
- mrpc
- qnli
- qqp
- rte
- sst
- stsb
- wnli
SuperGLUE
- boolq
- cb
- copa
- rte

Languages

Swedish
Danish
Norwegian (bokmål)

Dataset Structure

Data Instances

Every task has their own set of features, but all share an idx and label .

GLUE
- mnli
  - premise , hypothesis
- mrpc
  - text_a , text_b
- qnli
  - premise , hypothesis
- qqp
  - text_a , text_b
- sst
  - text
- stsb
  - text_a , text_b
- wnli
  - premise , hypothesis
SuperGLUE
- boolq
  - question , passage
- cb
  - premise , hypothesis
- copa
  - premise , choice1 , choice2 , question
- rte
  - premise , hypothesis

Data Splits

In order to have test-split, we repurpose the original validation-split as test-split, and split the training-split into a new training- and validation-split, with an 80-20 distribution.

Dataset Creation

For more information about the individual tasks see ( https://gluebenchmark.com ) and ( https://super.gluebenchmark.com ).

Curation Rationale

Training non-English models is easy, but there is a lack of evaluation datasets to compare their actual performance.

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

[More Information Needed]

Contributions

Thanks to @kb-labb for adding this dataset.

作者:

KBLab

数据集大小:

199.16 MB