中文

Arabic Flair + fastText Part-of-Speech tagging Model (Egyptian and Levant)

Pretrained Part-of-Speech tagging model built on a joint corpus written in Egyptian and Levantine (Jordanian, Lebanese, Palestinian, Syrian) dialects with code-switching of Egyptian Arabic and English. The model is trained using Flair (forward+backward)and fastText embeddings.

Pretraining Corpora:

This sequence labeling model was pretrained on three corpora jointly:

  • 4 Dialects A Dialectal Arabic Datasets containing four dialects of Arabic, Egyptian (EGY), Levantine (LEV), Gulf (GLF), and Maghrebi (MGR). Each dataset consists of a set of 350 manually segmented and PoS tagged tweets.
  • UD South Levantine Arabic MADAR A Dataset with 100 manually-annotated sentences taken from the MADAR (Multi-Arabic Dialect Applications and Resources) project by Shorouq Zahra .
  • Parts of the Cairo Students Code-Switch (CSCS) corpus developed for "Collection and Analysis of Code-switch Egyptian Arabic-English Speech Corpus" by Hamed et al.
  • Usage

    from flair.data import Sentence
    from flair.models import SequenceTagger
      
    tagger = SequenceTagger.load("megantosh/flair-arabic-dialects-codeswitch-egy-lev")
    sentence = Sentence('عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية  بالقاهرة .')
    tagger.predict(sentence)
    for entity in sentence.get_spans('pos'):
        print(entity)
    

    Due to the right-to-left in left-to-right context, some formatting errors might occur. and your code might appear like this , (link accessed on 2020-10-27)

    Scores & Tagset

    precision recall f1-score support
    INTJ 0.8182 0.9000 0.8571 10
    OUN 0.9009 0.9402 0.9201 435
    NUM 0.9524 0.8333 0.8889 24
    ADJ 0.8762 0.7603 0.8142 121
    ADP 0.9903 0.9623 0.9761 106
    CCONJ 0.9600 0.9730 0.9664 74
    PROPN 0.9333 0.9333 0.9333 15
    ADV 0.9135 0.8051 0.8559 118
    VERB 0.8852 0.9231 0.9038 117
    PRON 0.9620 0.9465 0.9542 187
    SCONJ 0.8571 0.9474 0.9000 19
    PART 0.9350 0.9791 0.9565 191
    DET 0.9348 0.9149 0.9247 47
    PUNCT 1.0000 1.0000 1.0000 35
    AUX 0.9286 0.9811 0.9541 53
    MENTION 0.9231 1.0000 0.9600 12
    V 0.8571 0.8780 0.8675 82
    FUT-PART+V+PREP+PRON 1.0000 0.0000 0.0000 1
    PROG-PART+V+PRON+PREP+PRON 0.0000 1.0000 0.0000 0
    ADJ+NSUFF 0.6111 0.8462 0.7097 26
    NOUN+NSUFF 0.8182 0.8438 0.8308 64
    PREP+PRON 0.9565 0.9565 0.9565 23
    PUNC 0.9941 1.0000 0.9971 169
    EOS 1.0000 1.0000 1.0000 70
    NOUN+PRON 0.6986 0.8500 0.7669 60
    V+PRON 0.7258 0.8036 0.7627 56
    PART+PRON 1.0000 0.9474 0.9730 19
    PROG-PART+V 0.8333 0.9302 0.8791 43
    DET+NOUN 0.9625 1.0000 0.9809 77
    NOUN+NSUFF+PRON 0.9091 0.7143 0.8000 14
    PROG-PART+V+PRON 0.7083 0.9444 0.8095 18
    PREP+NOUN+NSUFF 0.6667 0.4000 0.5000 5
    NOUN+NSUFF+NSUFF 1.0000 0.0000 0.0000 3
    CONJ 0.9722 1.0000 0.9859 35
    V+PRON+PRON 0.6364 0.5833 0.6087 12
    FOREIGN 0.6667 0.6667 0.6667 3
    PREP+NOUN 0.6316 0.7500 0.6857 16
    DET+NOUN+NSUFF 0.9000 0.9310 0.9153 29
    DET+ADJ+NSUFF 1.0000 0.5714 0.7273 7
    CONJ+PRON 1.0000 0.8750 0.9333 8
    NOUN+CASE 0.0000 0.0000 0.0000 2
    DET+ADJ 1.0000 0.6667 0.8000 6
    PREP 1.0000 0.9718 0.9857 71
    CONJ+FUT-PART+V 0.0000 0.0000 0.0000 1
    CONJ+V 0.6667 0.7500 0.7059 8
    FUT-PART 1.0000 1.0000 1.0000 2
    ADJ+PRON 1.0000 0.0000 0.0000 8
    CONJ+PREP+NOUN+PRON 1.0000 0.0000 0.0000 1
    CONJ+NOUN+PRON 0.3750 1.0000 0.5455 3
    PART+ADJ 1.0000 0.0000 0.0000 1
    PART+NOUN 0.5000 1.0000 0.6667 1
    CONJ+PREP+NOUN 1.0000 0.0000 0.0000 1
    CONJ+NOUN 0.7000 0.7778 0.7368 9
    URL 1.0000 1.0000 1.0000 3
    CONJ+FUT-PART 1.0000 0.0000 0.0000 1
    FUT-PART+V 0.8571 0.6000 0.7059 10
    PREP+NOUN+NSUFF+NSUFF 1.0000 0.0000 0.0000 1
    HASH 1.0000 0.9412 0.9697 17
    ADJ+PREP+PRON 1.0000 0.0000 0.0000 3
    PREP+NOUN+PRON 0.0000 0.0000 0.0000 1
    EMOT 1.0000 0.8889 0.9412 18
    CONJ+PREP 1.0000 0.7500 0.8571 4
    PREP+DET+NOUN+NSUFF 1.0000 0.7500 0.8571 4
    PRON+DET+NOUN+NSUFF 0.0000 1.0000 0.0000 0
    V+PREP+PRON 1.0000 0.0000 0.0000 5
    V+PRON+PREP+PRON 0.0000 1.0000 0.0000 0
    CONJ+NOUN+NSUFF 0.5000 0.5000 0.5000 2
    V+NEG-PART 1.0000 0.0000 0.0000 2
    PREP+DET+NOUN 0.9091 1.0000 0.9524 10
    PREP+V 1.0000 0.0000 0.0000 2
    CONJ+PART 1.0000 0.7778 0.8750 9
    CONJ+V+PRON 1.0000 1.0000 1.0000 5
    PROG-PART+V+PREP+PRON 1.0000 0.5000 0.6667 2
    PREP+NOUN+NSUFF+PRON 1.0000 1.0000 1.0000 1
    ADJ+CASE 1.0000 0.0000 0.0000 1
    PART+NOUN+PRON 1.0000 1.0000 1.0000 1
    PART+V 1.0000 0.0000 0.0000 3
    PART+V+PRON 0.0000 1.0000 0.0000 0
    FUT-PART+V+PRON 0.0000 1.0000 0.0000 0
    FUT-PART+V+PRON+PRON 1.0000 0.0000 0.0000 1
    CONJ+PREP+PRON 1.0000 0.0000 0.0000 1
    CONJ+V+PRON+PREP+PRON 1.0000 0.0000 0.0000 1
    CONJ+V+PREP+PRON 0.0000 1.0000 0.0000 0
    CONJ+DET+NOUN+NSUFF 1.0000 0.0000 0.0000 1
    CONJ+DET+NOUN 0.6667 1.0000 0.8000 2
    CONJ+PREP+DET+NOUN 1.0000 1.0000 1.0000 1
    PREP+PART 1.0000 0.0000 0.0000 2
    PART+V+PRON+NEG-PART 0.3333 0.3333 0.3333 3
    PART+V+NEG-PART 0.3333 0.5000 0.4000 2
    PART+PREP+NEG-PART 1.0000 1.0000 1.0000 3
    PART+PROG-PART+V+NEG-PART 1.0000 0.3333 0.5000 3
    PREP+DET+NOUN+NSUFF+PREP+PRON 1.0000 0.0000 0.0000 1
    PREP+PRON+DET+NOUN 0.0000 1.0000 0.0000 0
    PART+NSUFF 1.0000 0.0000 0.0000 1
    CONJ+PROG-PART+V+PRON 1.0000 1.0000 1.0000 1
    PART+PREP+PRON 1.0000 0.0000 0.0000 1
    CONJ+PART+PREP 1.0000 0.0000 0.0000 1
    NUM+NSUFF 0.6667 0.6667 0.6667 3
    CONJ+PART+V+PRON+NEG-PART 1.0000 1.0000 1.0000 1
    PART+NOUN+NEG-PART 1.0000 1.0000 1.0000 1
    CONJ+ADJ+NSUFF 1.0000 0.0000 0.0000 1
    PREP+ADJ 1.0000 0.0000 0.0000 1
    ADJ+NSUFF+PRON 1.0000 0.0000 0.0000 2
    CONJ+PROG-PART+V 1.0000 0.0000 0.0000 1
    CONJ+PART+PROG-PART+V+PREP+PRON+NEG-PART 1.0000 0.0000 0.0000 1
    CONJ+PART+PREP+PRON+NEG-PART 0.0000 1.0000 0.0000 0
    PREP+PART+PRON 1.0000 0.0000 0.0000 1
    CONJ+ADV+NSUFF 1.0000 0.0000 0.0000 1
    CONJ+ADV 0.0000 1.0000 0.0000 0
    PART+NOUN+PRON+NEG-PART 0.0000 1.0000 0.0000 0
    CONJ+ADJ 1.0000 1.0000 1.0000 1
    • F-score (micro): 0.8974
    • F-score (macro): 0.5188
    • Accuracy (incl. no class): 0.901

    Expand details below to show class scores for each tag. Note that tag compounds (a tag made for multiple agglutinated parts of speech) are considered as separate ones.

    Citation

    if you use this model, please consider citing this work :

    @unpublished{MMHU21
    author = "M. Megahed",
    title = "Sequence Labeling Architectures in Diglossia",
    year = {2021},
    doi = "10.13140/RG.2.2.34961.10084"
    url = {https://www.researchgate.net/publication/358956953_Sequence_Labeling_Architectures_in_Diglossia_-_a_case_study_of_Arabic_and_its_dialects}
    }