数据集:
conll2012_ontonotesv5
任务:
计算机处理:
multilingual大小:
10K<n<100K语言创建人:
found批注创建人:
expert-generated源数据集:
original许可:
OntoNotes v5.0 is the final version of OntoNotes corpus, and is a large-scale, multi-genre, multilingual corpus manually annotated with syntactic, semantic and discourse information.
This dataset is the version of OntoNotes v5.0 extended and is used in the CoNLL-2012 shared task. It includes v4 train/dev and v9 test data for English/Chinese/Arabic and corrected version v12 train/dev/test data (English only).
The source of data is the Mendeley Data repo ontonotes-conll2012 , which seems to be as the same as the official data, but users should use this dataset on their own responsibility.
See also summaries from paperwithcode, OntoNotes 5.0 and CoNLL-2012
For more detailed info of the dataset like annotation, tag set, etc., you can refer to the documents in the Mendeley repo mentioned above.
V4 data for Arabic, Chinese, English, and V12 data for English
{
{'document_id': 'nw/wsj/23/wsj_2311',
'sentences': [{'part_id': 0,
'words': ['CONCORDE', 'trans-Atlantic', 'flights', 'are', '$', '2, 'to', 'Paris', 'and', '$', '3, 'to', 'London', '.']},
'pos_tags': [25, 18, 27, 43, 2, 12, 17, 25, 11, 2, 12, 17, 25, 7],
'parse_tree': '(TOP(S(NP (NNP CONCORDE) (JJ trans-Atlantic) (NNS flights) )(VP (VBP are) (NP(NP(NP ($ $) (CD 2,400) )(PP (IN to) (NP (NNP Paris) ))) (CC and) (NP(NP ($ $) (CD 3,200) )(PP (IN to) (NP (NNP London) ))))) (. .) ))',
'predicate_lemmas': [None, None, None, 'be', None, None, None, None, None, None, None, None, None, None],
'predicate_framenet_ids': [None, None, None, '01', None, None, None, None, None, None, None, None, None, None],
'word_senses': [None, None, None, None, None, None, None, None, None, None, None, None, None, None],
'speaker': None,
'named_entities': [7, 6, 0, 0, 0, 15, 0, 5, 0, 0, 15, 0, 5, 0],
'srl_frames': [{'frames': ['B-ARG1', 'I-ARG1', 'I-ARG1', 'B-V', 'B-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'O'],
'verb': 'are'}],
'coref_spans': [],
{'part_id': 0,
'words': ['In', 'a', 'Centennial', 'Journal', 'article', 'Oct.', '5', ',', 'the', 'fares', 'were', 'reversed', '.']}]}
'pos_tags': [17, 13, 25, 25, 24, 25, 12, 4, 13, 27, 40, 42, 7],
'parse_tree': '(TOP(S(PP (IN In) (NP (DT a) (NML (NNP Centennial) (NNP Journal) ) (NN article) ))(NP (NNP Oct.) (CD 5) ) (, ,) (NP (DT the) (NNS fares) )(VP (VBD were) (VP (VBN reversed) )) (. .) ))',
'predicate_lemmas': [None, None, None, None, None, None, None, None, None, None, None, 'reverse', None],
'predicate_framenet_ids': [None, None, None, None, None, None, None, None, None, None, None, '01', None],
'word_senses': [None, None, None, None, None, None, None, None, None, None, None, None, None],
'speaker': None,
'named_entities': [0, 0, 4, 22, 0, 12, 30, 0, 0, 0, 0, 0, 0],
'srl_frames': [{'frames': ['B-ARGM-LOC', 'I-ARGM-LOC', 'I-ARGM-LOC', 'I-ARGM-LOC', 'I-ARGM-LOC', 'B-ARGM-TMP', 'I-ARGM-TMP', 'O', 'B-ARG1', 'I-ARG1', 'O', 'B-V', 'O'],
'verb': 'reversed'}],
'coref_spans': [],
}
Every element in sentences is a Dict composed of the following data fields:
Each dataset (arabic_v4, chinese_v4, english_v4, english_v12) has 3 splits: train , validation , and test
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
@inproceedings{pradhan-etal-2013-towards,
title = "Towards Robust Linguistic Analysis using {O}nto{N}otes",
author = {Pradhan, Sameer and
Moschitti, Alessandro and
Xue, Nianwen and
Ng, Hwee Tou and
Bj{\"o}rkelund, Anders and
Uryupina, Olga and
Zhang, Yuchen and
Zhong, Zhi},
booktitle = "Proceedings of the Seventeenth Conference on Computational Natural Language Learning",
month = aug,
year = "2013",
address = "Sofia, Bulgaria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/W13-3516",
pages = "143--152",
}
Thanks to @richarddwang for adding this dataset.