数据集:
persian_ner
任务:
标记分类语言:
fa计算机处理:
monolingual大小:
1K<n<10K语言创建人:
expert-generated批注创建人:
expert-generated源数据集:
original许可:
cc-by-4.0The dataset includes 7,682 Persian sentences, split into 250,015 tokens and their NER labels. It is available in 3 folds to be used in turn as training and test sets. The NER tags are in IOB format.
[More Information Needed]
[More Information Needed]
The NER tags correspond to this list:
"O", "I-event", "I-fac", "I-loc", "I-org", "I-pers", "I-pro", "B-event", "B-fac", "B-loc", "B-org", "B-pers", "B-pro"
Training and test splits
[More Information Needed]
[More Information Needed]
Who are the source language producers?Hanieh Poostchi, Ehsan Zare Borzeshi, Mohammad Abdous, Massimo Piccardi
[More Information Needed]
Who are the annotators?Hanieh Poostchi, Ehsan Zare Borzeshi, Mohammad Abdous, Massimo Piccardi
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Dataset is published for academic use only
[More Information Needed]
Creative Commons Attribution 4.0 International License.
@inproceedings{poostchi-etal-2016-personer, title = "{P}erso{NER}: {P}ersian Named-Entity Recognition", author = "Poostchi, Hanieh and Zare Borzeshi, Ehsan and Abdous, Mohammad and Piccardi, Massimo", booktitle = "Proceedings of {COLING} 2016, the 26th International Conference on Computational Linguistics: Technical Papers", month = dec, year = "2016", address = "Osaka, Japan", publisher = "The COLING 2016 Organizing Committee", url = " https://www.aclweb.org/anthology/C16-1319" , pages = "3381--3389", abstract = "Named-Entity Recognition (NER) is still a challenging task for languages with low digital resources. The main difficulties arise from the scarcity of annotated corpora and the consequent problematic training of an effective NER pipeline. To abridge this gap, in this paper we target the Persian language that is spoken by a population of over a hundred million people world-wide. We first present and provide ArmanPerosNERCorpus, the first manually-annotated Persian NER corpus. Then, we introduce PersoNER, an NER pipeline for Persian that leverages a word embedding and a sequential max-margin classifier. The experimental results show that the proposed approach is capable of achieving interesting MUC7 and CoNNL scores while outperforming two alternatives based on a CRF and a recurrent neural network.", }
Thanks to @KMFODA for adding this dataset.