🇧🇬 KeyBERT-BG - Bulgarian Keyword Extraction

KeyBERT-BG is a model trained for keyword extraction in Bulgarian. The used dataset is this custom one, which I've uploaded to Kaggle.

Usage

Import the libraries:

import re
from typing import Dict
from pprint import pprint

from transformers import AutoTokenizer, AutoModelForTokenClassification

Firstly, you'll have to define this method, since the text preprocessing is custom and the standard pipeline method won't suffice:

def extract_keywords(
    text: str, 
    model_id="auhide/keybert-bg", 
    max_len: int = 300,
    id2group: Dict[int, str] = {
        # Indicates that this is not a keyword.
        0: "O",
        # Begining of keyword.
        1: "B-KWD",
        # Additional keywords (might also indicate the end of a keyword sequence).
        # You can merge these with the begining keyword `B-KWD`.
        2: "I-KWD",
    },
    # Probability threshold based on which the keywords will be accepted.
    # If their probabiliy is less than `threshold`, they won't be added to the list of keywords.
    threshold=0.50
):
    # Initialize the tokenizer and model.
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    keybert = AutoModelForTokenClassification.from_pretrained(model_id)

    # Preprocess the text.
    # Surround punctuation with whitespace and convert multiple whitespaces
    # into single ones.
    text = re.sub(r"([,\.?!;:\'\"\(\)\[\]„”])", r" \1 ", text)
    text = re.sub(r"\s+", r" ", text)
    words = text.split()

    # Tokenize the processed `text` (this includes padding or truncation).
    tokens_data = tokenizer(
        text.strip(), 
        padding="max_length", 
        max_length=max_len, 
        truncation=True, 
        return_tensors="pt"
    )
    input_ids = tokens_data.input_ids
    attention_mask = tokens_data.attention_mask

    # Predict the keywords.
    out = keybert(input_ids, attention_mask=attention_mask).logits
    # Softmax the last dimension so that the probabilities add up to 1.0.
    out = out.softmax(-1)
    # Based on the probabilities, generate the most probable keywords.
    out_argmax = out.argmax(-1)
    prediction = out_argmax.squeeze(0).tolist()
    probabilities = out.squeeze(0)
    
    return [
        {
            # Since the list of words does not have a [CLS] token, the index `i`
            # is one step forward, which means that if we want to access the 
            # appropriate keyword we should use the index `i - 1`.
            "entity": words[i - 1],
            "entity_group": id2group[idx],
            "score": float(probabilities[i, idx])
        } 
        for i, idx in enumerate(prediction) 
        if (idx == 1 or idx == 2) and float(probabilities[i, idx]) > threshold
    ]

Choose a text and use the model on it. For example, I've chosen to use this article. Then, you can call extract_keywords on it and extract its keywords:

# Reading the text from a file, since it is an article, and the text is large.
with open("input_text.txt", "r", encoding="utf-8") as f:
    text = f.read()

# You can change the threshold based on your needs.
keywords = extract_keywords(text, threshold=0.5)
print("Keywords:")
pprint(keywords)

Keywords:
[{'entity': 'Туитър', 'entity_group': 'B-KWD', 'score': 0.9278278946876526},
 {'entity': 'Илон', 'entity_group': 'B-KWD', 'score': 0.5862686634063721},
 {'entity': 'Мъск', 'entity_group': 'B-KWD', 'score': 0.5289096832275391},
 {'entity': 'изпълнителен',
  'entity_group': 'B-KWD',
  'score': 0.679943323135376},
 {'entity': 'директор', 'entity_group': 'I-KWD', 'score': 0.6161141991615295}]

Please note that you can use the pipeline method from transformers but the results would be worse.

作者:

Adam Fauzi

数据集大小:

688.18 MB