模型:
savasy/bert-base-turkish-ner-cased
** Türkçe için kolay bir python NER (Bert + Transfer Learning) (İsim Varlık Tanıma) modeli...
Thanks to @stefan-it, I applied the followings for training
cd tr-data
for file in train.txt dev.txt test.txt labels.txt do wget https://schweter.eu/storage/turkish-bert-wikiann/$file done
cd .. It will download the pre-processed datasets with training, dev and test splits and put them in a tr-data folder.
Run pre-training After downloading the dataset, pre-training can be started. Just set the following environment variables:
export MAX_LENGTH=128 export BERT_MODEL=dbmdz/bert-base-turkish-cased export OUTPUT_DIR=tr-new-model export BATCH_SIZE=32 export NUM_EPOCHS=3 export SAVE_STEPS=625 export SEED=1
Then run pre-training:
python3 run_ner_old.py --data_dir ./tr-data3 \ --model_type bert \ --labels ./tr-data/labels.txt \ --model_name_or_path $BERT_MODEL \ --output_dir $OUTPUT_DIR-$SEED \ --max_seq_length $MAX_LENGTH \ --num_train_epochs $NUM_EPOCHS \ --per_gpu_train_batch_size $BATCH_SIZE \ --save_steps $SAVE_STEPS \ --seed $SEED \ --do_train \ --do_eval \ --do_predict \ --fp16
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer model = AutoModelForTokenClassification.from_pretrained("savasy/bert-base-turkish-ner-cased") tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-ner-cased") ner=pipeline('ner', model=model, tokenizer=tokenizer) ner("Mustafa Kemal Atatürk 19 Mayıs 1919'da Samsun'a ayak bastı.")
Data1: For the data above Eval Results:
Test Results:
Data2: https://github.com/stefan-it/turkish-bert/files/4558187/nerdata.txt The performance for the data given by @kemalaraz is as follows
savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat eval_results.txt
savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat test_results.txt