数据集:
tner/conll2003
CoNLL-2003 NER dataset formatted in a part of TNER project.
An example of train looks as follows.
{ 'tags': ['SOCCER','-', 'JAPAN', 'GET', 'LUCKY', 'WIN', ',', 'CHINA', 'IN', 'SURPRISE', 'DEFEAT', '.'], 'tokens': [0, 0, 5, 0, 0, 0, 0, 3, 0, 0, 0, 0] }
The label2id dictionary can be found at here .
{ "O": 0, "B-ORG": 1, "B-MISC": 2, "B-PER": 3, "I-PER": 4, "B-LOC": 5, "I-ORG": 6, "I-MISC": 7, "I-LOC": 8 }
name | train | validation | test |
---|---|---|---|
conll2003 | 14041 | 3250 | 3453 |
From the CoNLL2003 shared task page:
The English data is a collection of news wire articles from the Reuters Corpus. The annotation has been done by people of the University of Antwerp. Because of copyright reasons we only make available the annotations. In order to build the complete data sets you will need access to the Reuters Corpus. It can be obtained for research purposes without any charge from NIST.
The copyrights are defined below, from the Reuters Corpus page :
The stories in the Reuters Corpus are under the copyright of Reuters Ltd and/or Thomson Reuters, and their use is governed by the following agreements:
This agreement must be signed by the person responsible for the data at your organization, and sent to NIST.
This agreement must be signed by all researchers using the Reuters Corpus at your organization, and kept on file at your organization.
@inproceedings{tjong-kim-sang-de-meulder-2003-introduction, title = "Introduction to the {C}o{NLL}-2003 Shared Task: Language-Independent Named Entity Recognition", author = "Tjong Kim Sang, Erik F. and De Meulder, Fien", booktitle = "Proceedings of the Seventh Conference on Natural Language Learning at {HLT}-{NAACL} 2003", year = "2003", url = "https://www.aclweb.org/anthology/W03-0419", pages = "142--147", }