数据集:
ronec
任务:
标记分类语言:
ro计算机处理:
monolingual大小:
1K<n<10K批注创建人:
expert-generated源数据集:
original预印本库:
arxiv:1909.01247许可:
mitRONEC, at version 2.0, holds 12330 sentences with over 0.5M tokens, annotated with 15 classes, to a total of 80.283 distinctly annotated entities.
The corpus has the following classes and distribution in the train/valid/test splits:
| Classes | Total | Train | | Valid | | Test | | |------------- |:------: |:------: |:-------: |:------: |:-------: |:------: |:-------: | | | # | # | % | # | % | # | % | | PERSON | 26130 | 19167 | 73.35 | 2733 | 10.46 | 4230 | 16.19 | | GPE | 11103 | 8193 | 73.79 | 1182 | 10.65 | 1728 | 15.56 | | LOC | 2467 | 1824 | 73.94 | 270 | 10.94 | 373 | 15.12 | | ORG | 7880 | 5688 | 72.18 | 880 | 11.17 | 1312 | 16.65 | | LANGUAGE | 467 | 342 | 73.23 | 52 | 11.13 | 73 | 15.63 | | NAT_REL_POL | 4970 | 3673 | 73.90 | 516 | 10.38 | 781 | 15.71 | | DATETIME | 9614 | 6960 | 72.39 | 1029 | 10.7 | 1625 | 16.9 | | PERIOD | 1188 | 862 | 72.56 | 129 | 10.86 | 197 | 16.58 | | QUANTITY | 1588 | 1161 | 73.11 | 181 | 11.4 | 246 | 15.49 | | MONEY | 1424 | 1041 | 73.10 | 159 | 11.17 | 224 | 15.73 | | NUMERIC | 7735 | 5734 | 74.13 | 814 | 10.52 | 1187 | 15.35 | | ORDINAL | 1893 | 1377 | 72.74 | 212 | 11.2 | 304 | 16.06 | | FACILITY | 1126 | 840 | 74.6 | 113 | 10.04 | 173 | 15.36 | | WORK_OF_ART | 1596 | 1157 | 72.49 | 176 | 11.03 | 263 | 16.48 | | EVENT | 1102 | 826 | 74.95 | 107 | 9.71 | 169 | 15.34 |
The corpus is meant to train Named Entity Recognition models for the Romanian language.
Please see the leaderboard here : https://lirobenchmark.github.io/
RONEC is in Romanian ( ro )
The dataset is a list of instances. For example, an instance looks like:
{ "id": 10454, "tokens": ["Pentru", "a", "vizita", "locația", "care", "va", "fi", "pusă", "la", "dispoziția", "reprezentanților", "consiliilor", "județene", ",", "o", "delegație", "a", "U.N.C.J.R.", ",", "din", "care", "a", "făcut", "parte", "și", "dl", "Constantin", "Ostaficiuc", ",", "președintele", "C.J.T.", ",", "a", "fost", "prezentă", "la", "Bruxelles", ",", "între", "1-3", "martie", "."], "ner_tags": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "B-PERSON", "O", "O", "O", "O", "O", "O", "B-ORG", "O", "O", "O", "O", "O", "O", "O", "B-PERSON", "I-PERSON", "I-PERSON", "I-PERSON", "I-PERSON", "B-ORG", "O", "O", "O", "O", "O", "B-GPE", "O", "B-PERIOD", "I-PERIOD", "I-PERIOD", "O"], "ner_ids": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 2, 3, 0, 0, 0, 0, 0, 5, 0, 19, 20, 20, 0], "space_after": [true, true, true, true, true, true, true, true, true, true, true, true, false, true, true, true, true, false, true, true, true, true, true, true, true, true, true, false, true, true, false, true, true, true, true, true, false, true, true, true, false, false] }
The fields of each examples are:
The dataset is split in train: 9000 sentences, dev: 1330 sentence and test: 2000 sentences.
[Needs More Information]
The corpus data source represents sentences that are free of copyright, taken from older datasets like the freely available SEETimes and more recent datasources like the Romanian Wikipedia or the Common Crawl.
Initial Data Collection and Normalization[Needs More Information]
Who are the source language producers?[Needs More Information]
The corpus was annotated with the following classes:
The corpus was annotated by 3 language experts, and was cross-checked for annotation consistency. The annotation took several months to complete, but the result is a high quality dataset.
Who are the annotators?Stefan Dumitrescu (lead).
All the source data is already freely downloadable and usable online, so there are no privacy concerns.
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
MIT License
@article{dumitrescu2019introducing, title={Introducing RONEC--the Romanian Named Entity Corpus}, author={Dumitrescu, Stefan Daniel and Avram, Andrei-Marius}, journal={arXiv preprint arXiv:1909.01247}, year={2019} }
Thanks to @iliemihai for adding v1.0 of the dataset.