模型:
uaritm/multilingual_en_uk_pl_ru
This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
The model is used on the resource of multilingual analysis of patient complaints to determine the specialty of the doctor that is needed in this case: Virtual General Practice
You can test the quality and speed of the model
This model is an updated version of the model: uaritm/multilingual_en_ru_uk
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
pip install -U sentence-transformers
Then you can use the model like this: ```python from sentence_transformers import SentenceTransformer sentences = ["This is an example sentence", "Each sentence is converted"] model = SentenceTransformer('{MODEL_NAME}') embeddings = model.encode(sentences) print(embeddings)
Without sentence-transformers , you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
from transformers import AutoTokenizer, AutoModel import torch #Mean Pooling - Take attention mask into account for correct averaging def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] #First element of model_output contains all token embeddings input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) # Sentences we want sentence embeddings for sentences = ['This is an example sentence', 'Each sentence is converted'] # Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}') model = AutoModel.from_pretrained('{MODEL_NAME}') # Tokenize sentences encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') # Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input) # Perform pooling. In this case, mean pooling. sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) print("Sentence embeddings:") print(sentence_embeddings)
For an automated evaluation of this model, see the Sentence Embeddings Benchmark : https://seb.sbert.net
The model was trained with the parameters:
DataLoader :
torch.utils.data.dataloader.DataLoader of length 50184 with parameters:
{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
Loss :
sentence_transformers.losses.MSELoss.MSELoss
Parameters of the fit()-Method:
{ "epochs": 4, "evaluation_steps": 1000, "evaluator": "sentence_transformers.evaluation.SequentialEvaluator.SequentialEvaluator", "max_grad_norm": 1, "optimizer_class": "<class 'torch.optim.adamw.AdamW'>", "optimizer_params": { "eps": 1e-06, "lr": 2e-05 }, "scheduler": "WarmupLinear", "steps_per_epoch": null, "warmup_steps": 10000, "weight_decay": 0.01 }
SentenceTransformer( (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False}) )
@misc{Uaritm, title={sentence-transformers: Semantic similarity of medical texts}, author={Vitaliy Ostashko}, year={2023}, url={https://aihealth.site}, }