This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
Current version is distillation of the LaBSE model on private corpus.
Using this model becomes easy when you have sentence-transformers installed:
pip install -U sentence-transformers
Then you can use the model like this:
from sentence_transformers import SentenceTransformer from sentence_transformers.util import cos_sim sentences = [ "הם היו שמחים לראות את האירוע שהתקיים.", "לראות את האירוע שהתקיים היה מאוד משמח להם." ] model = SentenceTransformer('imvladikon/sentence-transformers-alephbert') embeddings = model.encode(sentences) print(cos_sim(*tuple(embeddings)).item()) # 0.883316159248352
Without sentence-transformers , you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
import torch from torch import nn from transformers import AutoTokenizer, AutoModel #Mean Pooling - Take attention mask into account for correct averaging def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] #First element of model_output contains all token embeddings input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) # Sentences we want sentence embeddings for sentences = [ "הם היו שמחים לראות את האירוע שהתקיים.", "לראות את האירוע שהתקיים היה מאוד משמח להם." ] # Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained('imvladikon/sentence-transformers-alephbert') model = AutoModel.from_pretrained('imvladikon/sentence-transformers-alephbert') # Tokenize sentences encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') # Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input) # Perform pooling. In this case, mean pooling. sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) cos_sim = nn.CosineSimilarity(dim=0, eps=1e-6) print(cos_sim(sentence_embeddings[0], sentence_embeddings[1]).item())
For an automated evaluation of this model, see the Sentence Embeddings Benchmark : https://seb.sbert.net
The model was trained with the parameters:
DataLoader :
torch.utils.data.dataloader.DataLoader of length 44999 with parameters:
{'batch_size': 8, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
Loss :
sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss with parameters:
{'scale': 20.0, 'similarity_fct': 'cos_sim'}
Parameters of the fit()-Method:
{ "epochs": 10, "evaluation_steps": 0, "evaluator": "NoneType", "max_grad_norm": 1, "optimizer_class": "<class 'torch.optim.adamw.AdamW'>", "optimizer_params": { "lr": 2e-05 }, "scheduler": "WarmupLinear", "steps_per_epoch": null, "warmup_steps": 44999, "weight_decay": 0.01 }
SentenceTransformer( (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False}) )
@misc{seker2021alephberta, title={AlephBERT:A Hebrew Large Pre-Trained Language Model to Start-off your Hebrew NLP Application With}, author={Amit Seker and Elron Bandel and Dan Bareket and Idan Brusilovsky and Refael Shaked Greenfeld and Reut Tsarfaty}, year={2021}, eprint={2104.04052}, archivePrefix={arXiv}, primaryClass={cs.CL} }
@misc{reimers2019sentencebert, title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks}, author={Nils Reimers and Iryna Gurevych}, year={2019}, eprint={1908.10084}, archivePrefix={arXiv}, primaryClass={cs.CL} }