数据集:
polyglot_ner
任务:
标记分类计算机处理:
multilingual语言创建人:
found批注创建人:
machine-generated源数据集:
original预印本库:
arxiv:1410.3791许可:
license:unknownPolyglot-NER A training dataset automatically generated from Wikipedia and Freebase the task of named entity recognition. The dataset contains the basic Wikipedia based training data for 40 languages we have (with coreference resolution) for the task of named entity recognition. The details of the procedure of generating them is outlined in Section 3 of the paper ( https://arxiv.org/abs/1410.3791 ). Each config contains the data corresponding to a different language. For example, "es" includes only spanish examples.
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": "2", "lang": "ar", "ner": ["O", "O", "O", "O", "O", "O", "O", "O", "LOC", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "PER", "PER", "PER", "PER", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], "words": "[\"وفي\", \"مرحلة\", \"موالية\", \"أنشأت\", \"قبيلة\", \"مكناسة\", \"الزناتية\", \"مكناسة\", \"تازة\", \",\", \"وأقام\", \"بها\", \"المرابطون\", \"قلعة\", \"..." }bg
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": "1", "lang": "bg", "ner": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], "words": "[\"Дефиниция\", \"Наименованията\", \"\\\"\", \"книжовен\", \"\\\"/\\\"\", \"литературен\", \"\\\"\", \"език\", \"на\", \"български\", \"за\", \"тази\", \"кодифи..." }ca
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": "2", "lang": "ca", "ner": "[\"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O\", \"O...", "words": "[\"Com\", \"a\", \"compositor\", \"deixà\", \"un\", \"immens\", \"llegat\", \"que\", \"inclou\", \"8\", \"simfonies\", \"(\", \"1822\", \"),\", \"diverses\", ..." }combined
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": "18", "lang": "es", "ner": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], "words": "[\"Los\", \"cambios\", \"en\", \"la\", \"energía\", \"libre\", \"de\", \"Gibbs\", \"\\\\\", \"Delta\", \"G\", \"nos\", \"dan\", \"una\", \"cuantificación\", \"de..." }cs
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": "3", "lang": "cs", "ner": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], "words": "[\"Historie\", \"Symfonická\", \"forma\", \"se\", \"rozvinula\", \"se\", \"především\", \"v\", \"období\", \"klasicismu\", \"a\", \"romantismu\", \",\", \"..." }
The data fields are the same among all splits.
arname | train |
---|---|
ar | 339109 |
bg | 559694 |
ca | 372665 |
combined | 21070925 |
cs | 564462 |
@article{polyglotner, author = {Al-Rfou, Rami and Kulkarni, Vivek and Perozzi, Bryan and Skiena, Steven}, title = {{Polyglot-NER}: Massive Multilingual Named Entity Recognition}, journal = {{Proceedings of the 2015 {SIAM} International Conference on Data Mining, Vancouver, British Columbia, Canada, April 30- May 2, 2015}}, month = {April}, year = {2015}, publisher = {SIAM}, }
Thanks to @joeddav for adding this dataset.