数据集:
DFKI-SLT/cross_ner
任务:
标记分类语言:
en计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
expert-generated源数据集:
extended|conll2003预印本库:
arxiv:2012.04373CrossNER is a fully-labeled collected of named entity recognition (NER) data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specialized entity categories for different domains. Additionally, CrossNER also includes unlabeled domain-related corpora for the corresponding five domains.
For details, see the paper: CrossNER: Evaluating Cross-Domain Named Entity Recognition
The language data in CrossNER is in English (BCP-47 en)
An example of 'train' looks as follows:
{ "id": "0", "tokens": ["EU", "rejects", "German", "call", "to", "boycott", "British", "lamb", "."], "ner_tags": [49, 0, 41, 0, 0, 0, 41, 0, 0] }politics
An example of 'train' looks as follows:
{ "id": "0", "tokens": ["Parties", "with", "mainly", "Eurosceptic", "views", "are", "the", "ruling", "United", "Russia", ",", "and", "opposition", "parties", "the", "Communist", "Party", "of", "the", "Russian", "Federation", "and", "Liberal", "Democratic", "Party", "of", "Russia", "."], "ner_tags": [0, 0, 0, 0, 0, 0, 0, 0, 55, 56, 0, 0, 0, 0, 0, 55, 56, 56, 56, 56, 56, 0, 55, 56, 56, 56, 56, 0] }science
An example of 'train' looks as follows:
{ "id": "0", "tokens": ["They", "may", "also", "use", "Adenosine", "triphosphate", ",", "Nitric", "oxide", ",", "and", "ROS", "for", "signaling", "in", "the", "same", "ways", "that", "animals", "do", "."], "ner_tags": [0, 0, 0, 0, 15, 16, 0, 15, 16, 0, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] }music
An example of 'train' looks as follows:
{ "id": "0", "tokens": ["In", "2003", ",", "the", "Stade", "de", "France", "was", "the", "primary", "site", "of", "the", "2003", "World", "Championships", "in", "Athletics", "."], "ner_tags": [0, 0, 0, 0, 35, 36, 36, 0, 0, 0, 0, 0, 0, 29, 30, 30, 30, 30, 0] }literature
An example of 'train' looks as follows:
{ "id": "0", "tokens": ["In", "1351", ",", "during", "the", "reign", "of", "Emperor", "Toghon", "Temür", "of", "the", "Yuan", "dynasty", ",", "93rd-generation", "descendant", "Kong", "Huan", "(", "孔浣", ")", "'", "s", "2nd", "son", "Kong", "Shao", "(", "孔昭", ")", "moved", "from", "China", "to", "Korea", "during", "the", "Goryeo", ",", "and", "was", "received", "courteously", "by", "Princess", "Noguk", "(", "the", "Mongolian-born", "wife", "of", "the", "future", "king", "Gongmin", ")", "."], "ner_tags": [0, 0, 0, 0, 0, 0, 0, 51, 52, 52, 0, 0, 21, 22, 0, 0, 0, 77, 78, 0, 77, 0, 0, 0, 0, 0, 77, 78, 0, 77, 0, 0, 0, 21, 0, 21, 0, 0, 41, 0, 0, 0, 0, 0, 0, 51, 52, 0, 0, 41, 0, 0, 0, 0, 0, 51, 0, 0] }ai
An example of 'train' looks as follows:
{ "id": "0", "tokens": ["Popular", "approaches", "of", "opinion-based", "recommender", "system", "utilize", "various", "techniques", "including", "text", "mining", ",", "information", "retrieval", ",", "sentiment", "analysis", "(", "see", "also", "Multimodal", "sentiment", "analysis", ")", "and", "deep", "learning", "X.Y.", "Feng", ",", "H.", "Zhang", ",", "Y.J.", "Ren", ",", "P.H.", "Shang", ",", "Y.", "Zhu", ",", "Y.C.", "Liang", ",", "R.C.", "Guan", ",", "D.", "Xu", ",", "(", "2019", ")", ",", ",", "21", "(", "5", ")", ":", "e12957", "."], "ner_tags": [0, 0, 0, 59, 60, 60, 0, 0, 0, 0, 31, 32, 0, 71, 72, 0, 71, 72, 0, 0, 0, 71, 72, 72, 0, 0, 31, 32, 65, 66, 0, 65, 66, 0, 65, 66, 0, 65, 66, 0, 65, 66, 0, 65, 66, 0, 65, 66, 0, 65, 66, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] }
The data fields are the same among all splits.
{"O": 0, "B-academicjournal": 1, "I-academicjournal": 2, "B-album": 3, "I-album": 4, "B-algorithm": 5, "I-algorithm": 6, "B-astronomicalobject": 7, "I-astronomicalobject": 8, "B-award": 9, "I-award": 10, "B-band": 11, "I-band": 12, "B-book": 13, "I-book": 14, "B-chemicalcompound": 15, "I-chemicalcompound": 16, "B-chemicalelement": 17, "I-chemicalelement": 18, "B-conference": 19, "I-conference": 20, "B-country": 21, "I-country": 22, "B-discipline": 23, "I-discipline": 24, "B-election": 25, "I-election": 26, "B-enzyme": 27, "I-enzyme": 28, "B-event": 29, "I-event": 30, "B-field": 31, "I-field": 32, "B-literarygenre": 33, "I-literarygenre": 34, "B-location": 35, "I-location": 36, "B-magazine": 37, "I-magazine": 38, "B-metrics": 39, "I-metrics": 40, "B-misc": 41, "I-misc": 42, "B-musicalartist": 43, "I-musicalartist": 44, "B-musicalinstrument": 45, "I-musicalinstrument": 46, "B-musicgenre": 47, "I-musicgenre": 48, "B-organisation": 49, "I-organisation": 50, "B-person": 51, "I-person": 52, "B-poem": 53, "I-poem": 54, "B-politicalparty": 55, "I-politicalparty": 56, "B-politician": 57, "I-politician": 58, "B-product": 59, "I-product": 60, "B-programlang": 61, "I-programlang": 62, "B-protein": 63, "I-protein": 64, "B-researcher": 65, "I-researcher": 66, "B-scientist": 67, "I-scientist": 68, "B-song": 69, "I-song": 70, "B-task": 71, "I-task": 72, "B-theory": 73, "I-theory": 74, "B-university": 75, "I-university": 76, "B-writer": 77, "I-writer": 78}
Train | Dev | Test | |
---|---|---|---|
conll2003 | 14,987 | 3,466 | 3,684 |
politics | 200 | 541 | 651 |
science | 200 | 450 | 543 |
music | 100 | 380 | 456 |
literature | 100 | 400 | 416 |
ai | 100 | 350 | 431 |
@article{liu2020crossner, title={CrossNER: Evaluating Cross-Domain Named Entity Recognition}, author={Zihan Liu and Yan Xu and Tiezheng Yu and Wenliang Dai and Ziwei Ji and Samuel Cahyawijaya and Andrea Madotto and Pascale Fung}, year={2020}, eprint={2012.04373}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Thanks to @phucdev for adding this dataset.