The underlying corpus, NerKor+CARS-OntoNotes++ , was derived from NYTK-NerKor , a Hungarian gold standard named entity annotated corpus containing about 1 million tokens. It includes a small addition of 12k tokens of text (individual sentences) concerning motor vehicles (cars, buses, motorcycles) from the news archive of hvg.hu . While the annotation in NYTK-NerKor followed the CoNLL2002 labelling standard with just four NE categories ( PER , LOC , MISC , ORG ), this version of the corpus features over 30 entity types, including all entity types used in the [OntoNotes 5.0] English NER annotation. The new annotation elaborates on subtypes of the LOC and MISC entity types, and includes annotation for non-names like times and dates, quantities, languages and nationalities or religious or political groups. The annotation was elaborated with further entity subtypes not present in the Ontonotes 5 annotation (see below).
Names are annotated according to the following set of types:
| PER | = PERSON People, including fictional | 
| FAC | = FACILITY Buildings, airports, highways, bridges, etc. | 
| ORG | = ORGANIZATION Companies, agencies, institutions, etc. | 
| GPE | Geopolitical entites: countries, cities, states | 
| LOC | = LOCATION Non-GPE locations, mountain ranges, bodies of water | 
| PROD | = PRODUCT Vehicles, weapons, foods, etc. (Not services) | 
| EVENT | Named hurricanes, battles, wars, sports events, etc. | 
| WORK_OF_ART | Titles of books, songs, etc. | 
| LAW | Named documents made into laws | 
The following are also annotated in a style similar to names:
| NORP | Nationalities or religious or political groups | 
| LANGUAGE | Any named language | 
| DATE | Absolute or relative dates or periods | 
| TIME | Times smaller than a day | 
| PERCENT | Percentage (including "%") | 
| MONEY | Monetary values, including unit | 
| QUANTITY | Measurements, as of weight or distance | 
| ORDINAL | "first", "second" | 
| CARDINAL | Numerals that do not fall under another type | 
Further subtypes of names of type MISC :
| AWARD | Awards and prizes | 
| CAR | Cars and other motor vehicles | 
| MEDIA | Media outlets, TV channels, news portals | 
| SMEDIA | Social media platforms | 
| PROJ | Projects and initiatives | 
| MISC | Unresolved subtypes of MISC entities | 
| MISC-ORG | Organization-like unresolved subtypes of MISC entities | 
Further non-name entities:
| DUR | Time duration | 
| AGE | Age | 
| ID | Identifier | 
@inproceedings{novak-novak-2022-nerkor,
    title = "{N}er{K}or+{C}ars-{O}nto{N}otes++",
    author = "Nov{\'a}k, Attila  and
      Nov{\'a}k, Borb{\'a}la",
    booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.lrec-1.203",
    pages = "1907--1916",
    abstract = "In this paper, we present an upgraded version of the Hungarian NYTK-NerKor named entity corpus, which contains about twice as many annotated spans and 7 times as many distinct entity types as the original version. We used an extended version of the OntoNotes 5 annotation scheme including time and numerical expressions. NerKor is the newest and biggest NER corpus for Hungarian containing diverse domains. We applied cross-lingual transfer of NER models trained for other languages based on multilingual contextual language models to preannotate the corpus. We corrected the annotation semi-automatically and manually. Zero-shot preannotation was very effective with about 0.82 F1 score for the best model. We also added a 12000-token subcorpus on cars and other motor vehicles. We trained and release a transformer-based NER tagger for Hungarian using the annotation in the new corpus version, which provides similar performance to an identical model trained on the original version of the corpus.",
}