数据集:

shunk031/jsnli

其他:

natural-language-inference nli jsnli

许可:

cc-by-sa-4.0

计算机处理:

monolingual

语言:

子任务:

multi-input-text-classification natural-language-inference

任务:

文本分类

数据集介绍文件清单

中文

Dataset Card for JSNLI

Dataset Summary

日本語 SNLI(JSNLI) データセット - KUROHASHI-CHU-MURAWAKI LAB より：

本データセットは自然言語推論 (NLI) の標準的ベンチマークである SNLI を日本語に翻訳したものです。

Dataset Preprocessing

Supported Tasks and Leaderboards

Languages

注釈はすべて日本語を主要言語としています。

Dataset Structure

データセットは TSV フォーマットで、各行がラベル、前提、仮説の三つ組を表します。前提、仮説は JUMAN++ によって形態素分割されています。以下に例をあげます。

entailment      自転車 で ２ 人 の 男性 が レース で 競い ます 。       人々 は 自転車 に 乗って います 。

Data Instances

from datasets import load_dataset
load_dataset("shunk031/jsnli", "without-filtering")

{
    'label': 'neutral', 
    'premise': 'ガレージ で 、 壁 に ナイフ を 投げる 男 。', 
    'hypothesis': '男 は 魔法 の ショー の ため に ナイフ を 投げる 行為 を 練習 して い ます 。'
}

Data Fields

Data Splits

name	train	validation
without-filtering	548,014	3,916
with-filtering	533,005	3,916

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization Who are the source language producers?

Annotations

Annotation process

SNLI に機械翻訳を適用した後、評価データにクラウドソーシングによる正確なフィルタリング、学習データに計算機による自動フィルタリングを施すことで構築されています。データセットは学習データを全くフィルタリングしていないものと、フィルタリングした中で最も精度が高かったものの 2 種類を公開しています。データサイズは、フィルタリング前の学習データが 548,014 ペア、フィルタリング後の学習データが 533,005 ペア、評価データは 3,916 ペアです。詳細は参考文献を参照してください。

Who are the annotators?

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

本データセットに関するご質問は nl-resource あっと nlp.ist.i.kyoto-u.ac.jp 宛にお願いいたします。

Dataset Curators

Licensing Information

このデータセットのライセンスは、SNLI のライセンスと同じ CC BY-SA 4.0 に従います。SNLI に関しては参考文献を参照してください。

Citation Information

@article{吉越卓見 2020 機械翻訳を用いた自然言語推論データセットの多言語化，
  title={機械翻訳を用いた自然言語推論データセットの多言語化},
  author={吉越卓見 and 河原大輔 and 黒橋禎夫 and others},
  journal={研究報告自然言語処理 (NL)},
  volume={2020},
  number={6},
  pages={1--8},
  year={2020}
}

@inproceedings{bowman2015large,
  title={A large annotated corpus for learning natural language inference},
  author={Bowman, Samuel and Angeli, Gabor and Potts, Christopher and Manning, Christopher D},
  booktitle={Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing},
  pages={632--642},
  year={2015}
}

@article{young2014image,
  title={From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions},
  author={Young, Peter and Lai, Alice and Hodosh, Micah and Hockenmaier, Julia},
  journal={Transactions of the Association for Computational Linguistics},
  volume={2},
  pages={67--78},
  year={2014},
  publisher={MIT Press}
}

Contributions

JSNLI データセットを公開してくださった吉越卓見さま，河原大輔さま，黒橋禎夫さまに心から感謝します。

作者:

shunk031

数据集大小:

124.92 KB