模型:

novakat/nerkor-cars-onpp-hubert

语言:

hu

其他:

bert

许可:

gpl
中文

Hungarian named entity recognition model with OntoNotes5 + more entity types

  • Pretrained model used: SZTAKI-HLT/hubert-base-cc
  • Finetuned on NerKor+CARS-ONPP Corpus

Limitations

  • max_seq_length = 448

Training data

The underlying corpus, NerKor+CARS-OntoNotes++ , was derived from NYTK-NerKor , a Hungarian gold standard named entity annotated corpus containing about 1 million tokens. It includes a small addition of 12k tokens of text (individual sentences) concerning motor vehicles (cars, buses, motorcycles) from the news archive of hvg.hu . While the annotation in NYTK-NerKor followed the CoNLL2002 labelling standard with just four NE categories ( PER , LOC , MISC , ORG ), this version of the corpus features over 30 entity types, including all entity types used in the [OntoNotes 5.0] English NER annotation. The new annotation elaborates on subtypes of the LOC and MISC entity types, and includes annotation for non-names like times and dates, quantities, languages and nationalities or religious or political groups. The annotation was elaborated with further entity subtypes not present in the Ontonotes 5 annotation (see below).

Tags derived from the OntoNotes 5.0 annotation

Names are annotated according to the following set of types:

PER = PERSON People, including fictional
FAC = FACILITY Buildings, airports, highways, bridges, etc.
ORG = ORGANIZATION Companies, agencies, institutions, etc.
GPE Geopolitical entites: countries, cities, states
LOC = LOCATION Non-GPE locations, mountain ranges, bodies of water
PROD = PRODUCT Vehicles, weapons, foods, etc. (Not services)
EVENT Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART Titles of books, songs, etc.
LAW Named documents made into laws

The following are also annotated in a style similar to names:

NORP Nationalities or religious or political groups
LANGUAGE Any named language
DATE Absolute or relative dates or periods
TIME Times smaller than a day
PERCENT Percentage (including "%")
MONEY Monetary values, including unit
QUANTITY Measurements, as of weight or distance
ORDINAL "first", "second"
CARDINAL Numerals that do not fall under another type

Additional tags (not in OntoNotes 5)

Further subtypes of names of type MISC :

AWARD Awards and prizes
CAR Cars and other motor vehicles
MEDIA Media outlets, TV channels, news portals
SMEDIA Social media platforms
PROJ Projects and initiatives
MISC Unresolved subtypes of MISC entities
MISC-ORG Organization-like unresolved subtypes of MISC entities

Further non-name entities:

DUR Time duration
AGE Age
ID Identifier

If you use this model, please cite:

@inproceedings{novak-novak-2022-nerkor,
    title = "{N}er{K}or+{C}ars-{O}nto{N}otes++",
    author = "Nov{\'a}k, Attila  and
      Nov{\'a}k, Borb{\'a}la",
    booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.lrec-1.203",
    pages = "1907--1916",
    abstract = "In this paper, we present an upgraded version of the Hungarian NYTK-NerKor named entity corpus, which contains about twice as many annotated spans and 7 times as many distinct entity types as the original version. We used an extended version of the OntoNotes 5 annotation scheme including time and numerical expressions. NerKor is the newest and biggest NER corpus for Hungarian containing diverse domains. We applied cross-lingual transfer of NER models trained for other languages based on multilingual contextual language models to preannotate the corpus. We corrected the annotation semi-automatically and manually. Zero-shot preannotation was very effective with about 0.82 F1 score for the best model. We also added a 12000-token subcorpus on cars and other motor vehicles. We trained and release a transformer-based NER tagger for Hungarian using the annotation in the new corpus version, which provides similar performance to an identical model trained on the original version of the corpus.",
}