数据集:

emrecan/nli_tr_for_simcse

中文

NLI-TR for Supervised SimCSE

This dataset is a modified version of NLI-TR dataset. Its intended use is to train Supervised SimCSE models for sentence-embeddings. Steps followed to produce this dataset are listed below:

  • Merge train split of snli_tr and multinli_tr subsets.
  • Find every premise that has an entailment hypothesis and a contradiction hypothesis.
  • Write found triplets into sent0 (premise), sent1 (entailment hypothesis), hard_neg (contradiction hypothesis) format.
  • See this Colab Notebook for training and evaluation on Turkish sentences.