数据集:

conll2012_ontonotesv5

中文

Dataset Card for CoNLL2012 shared task data based on OntoNotes 5.0

Dataset Summary

OntoNotes v5.0 is the final version of OntoNotes corpus, and is a large-scale, multi-genre, multilingual corpus manually annotated with syntactic, semantic and discourse information.

This dataset is the version of OntoNotes v5.0 extended and is used in the CoNLL-2012 shared task. It includes v4 train/dev and v9 test data for English/Chinese/Arabic and corrected version v12 train/dev/test data (English only).

The source of data is the Mendeley Data repo ontonotes-conll2012 , which seems to be as the same as the official data, but users should use this dataset on their own responsibility.

See also summaries from paperwithcode, OntoNotes 5.0 and CoNLL-2012

For more detailed info of the dataset like annotation, tag set, etc., you can refer to the documents in the Mendeley repo mentioned above.

Supported Tasks and Leaderboards

Languages

V4 data for Arabic, Chinese, English, and V12 data for English

Dataset Structure

Data Instances

{
  {'document_id': 'nw/wsj/23/wsj_2311',
 'sentences': [{'part_id': 0,
                'words': ['CONCORDE', 'trans-Atlantic', 'flights', 'are', '$', '2, 'to', 'Paris', 'and', '$', '3, 'to', 'London', '.']},
                'pos_tags': [25, 18, 27, 43, 2, 12, 17, 25, 11, 2, 12, 17, 25, 7],
                'parse_tree': '(TOP(S(NP (NNP CONCORDE)  (JJ trans-Atlantic)  (NNS flights) )(VP (VBP are) (NP(NP(NP ($ $)  (CD 2,400) )(PP (IN to) (NP (NNP Paris) ))) (CC and) (NP(NP ($ $)  (CD 3,200) )(PP (IN to) (NP (NNP London) ))))) (. .) ))',
                'predicate_lemmas': [None, None, None, 'be', None, None, None, None, None, None, None, None, None, None],
                'predicate_framenet_ids': [None, None, None, '01', None, None, None, None, None, None, None, None, None, None],
                'word_senses': [None, None, None, None, None, None, None, None, None, None, None, None, None, None],
                'speaker': None,
                'named_entities': [7, 6, 0, 0, 0, 15, 0, 5, 0, 0, 15, 0, 5, 0],
                'srl_frames': [{'frames': ['B-ARG1', 'I-ARG1', 'I-ARG1', 'B-V', 'B-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'O'],
                                'verb': 'are'}],
                'coref_spans': [],
               {'part_id': 0,
                'words': ['In', 'a', 'Centennial', 'Journal', 'article', 'Oct.', '5', ',', 'the', 'fares', 'were', 'reversed', '.']}]}
                'pos_tags': [17, 13, 25, 25, 24, 25, 12, 4, 13, 27, 40, 42, 7],
                'parse_tree': '(TOP(S(PP (IN In) (NP (DT a) (NML (NNP Centennial)  (NNP Journal) ) (NN article) ))(NP (NNP Oct.)  (CD 5) ) (, ,) (NP (DT the)  (NNS fares) )(VP (VBD were) (VP (VBN reversed) )) (. .) ))',
                'predicate_lemmas': [None, None, None, None, None, None, None, None, None, None, None, 'reverse', None],
                'predicate_framenet_ids': [None, None, None, None, None, None, None, None, None, None, None, '01', None],
                'word_senses': [None, None, None, None, None, None, None, None, None, None, None, None, None],
                'speaker': None,
                'named_entities': [0, 0, 4, 22, 0, 12, 30, 0, 0, 0, 0, 0, 0],
                'srl_frames': [{'frames': ['B-ARGM-LOC', 'I-ARGM-LOC', 'I-ARGM-LOC', 'I-ARGM-LOC', 'I-ARGM-LOC', 'B-ARGM-TMP', 'I-ARGM-TMP', 'O', 'B-ARG1', 'I-ARG1', 'O', 'B-V', 'O'],
                                'verb': 'reversed'}],
                'coref_spans': [],
}

Data Fields

  • document_id ( str ): This is a variation on the document filename
  • sentences ( List[Dict] ): All sentences of the same document are in a single example for the convenience of concatenating sentences.

Every element in sentences is a Dict composed of the following data fields:

  • part_id ( int ) : Some files are divided into multiple parts numbered as 000, 001, 002, ... etc.
  • words ( List[str] ) :
  • pos_tags ( List[ClassLabel] or List[str] ) : This is the Penn-Treebank-style part of speech. When parse information is missing, all parts of speech except the one for which there is some sense or proposition annotation are marked with a XX tag. The verb is marked with just a VERB tag.
    • tag set : Note tag sets below are founded by scanning all the data, and I found it seems to be a little bit different from officially stated tag sets. See official documents in the Mendeley repo
      • arabic : str. Because pos tag in Arabic is compounded and complex, hard to represent it by ClassLabel
      • chinese v4 : datasets.ClassLabel(num_classes=36, names=["X", "AD", "AS", "BA", "CC", "CD", "CS", "DEC", "DEG", "DER", "DEV", "DT", "ETC", "FW", "IJ", "INF", "JJ", "LB", "LC", "M", "MSP", "NN", "NR", "NT", "OD", "ON", "P", "PN", "PU", "SB", "SP", "URL", "VA", "VC", "VE", "VV",]) , where X is for pos tag missing
      • english v4 : datasets.ClassLabel(num_classes=49, names=["XX", "``", "$", "''", ",", "-LRB-", "-RRB-", ".", ":", "ADD", "AFX", "CC", "CD", "DT", "EX", "FW", "HYPH", "IN", "JJ", "JJR", "JJS", "LS", "MD", "NFP", "NN", "NNP", "NNPS", "NNS", "PDT", "POS", "PRP", "PRP$", "RB", "RBR", "RBS", "RP", "SYM", "TO", "UH", "VB", "VBD", "VBG", "VBN", "VBP", "VBZ", "WDT", "WP", "WP$", "WRB",]) , where XX is for pos tag missing, and -LRB- / -RRB- is " ( " / " ) ".
      • english v12 : datasets.ClassLabel(num_classes=51, names="english_v12": ["XX", "``", "$", "''", "*", ",", "-LRB-", "-RRB-", ".", ":", "ADD", "AFX", "CC", "CD", "DT", "EX", "FW", "HYPH", "IN", "JJ", "JJR", "JJS", "LS", "MD", "NFP", "NN", "NNP", "NNPS", "NNS", "PDT", "POS", "PRP", "PRP$", "RB", "RBR", "RBS", "RP", "SYM", "TO", "UH", "VB", "VBD", "VBG", "VBN", "VBP", "VBZ", "VERB", "WDT", "WP", "WP$", "WRB",]) , where XX is for pos tag missing, and -LRB- / -RRB- is " ( " / " ) ".
  • parse_tree ( Optional[str] ) : An serialized NLTK Tree representing the parse. It includes POS tags as pre-terminal nodes. When the parse information is missing, the parse will be None .
  • predicate_lemmas ( List[Optional[str]] ) : The predicate lemma of the words for which we have semantic role information or word sense information. All other indices are None .
  • predicate_framenet_ids ( List[Optional[int]] ) : The PropBank frameset ID of the lemmas in predicate_lemmas, or None .
  • word_senses ( List[Optional[float]] ) : The word senses for the words in the sentence, or None. These are floats because the word sense can have values after the decimal, like 1.1.
  • speaker ( Optional[str] ) : This is the speaker or author name where available. Mostly in Broadcast Conversation and Web Log data. When it is not available, it will be None .
  • named_entities ( List[ClassLabel] ) : The BIO tags for named entities in the sentence.
    • tag set : datasets.ClassLabel(num_classes=37, names=["O", "B-PERSON", "I-PERSON", "B-NORP", "I-NORP", "B-FAC", "I-FAC", "B-ORG", "I-ORG", "B-GPE", "I-GPE", "B-LOC", "I-LOC", "B-PRODUCT", "I-PRODUCT", "B-DATE", "I-DATE", "B-TIME", "I-TIME", "B-PERCENT", "I-PERCENT", "B-MONEY", "I-MONEY", "B-QUANTITY", "I-QUANTITY", "B-ORDINAL", "I-ORDINAL", "B-CARDINAL", "I-CARDINAL", "B-EVENT", "I-EVENT", "B-WORK_OF_ART", "I-WORK_OF_ART", "B-LAW", "I-LAW", "B-LANGUAGE", "I-LANGUAGE",])
  • srl_frames ( List[{"word":str, "frames":List[str]}] ) : A dictionary keyed by the verb in the sentence for the given Propbank frame labels, in a BIO format.
  • coref spans ( List[List[int]] ) : The spans for entity mentions involved in coreference resolution within the sentence. Each element is a tuple composed of (cluster_id, start_index, end_index). Indices are inclusive.

Data Splits

Each dataset (arabic_v4, chinese_v4, english_v4, english_v12) has 3 splits: train , validation , and test

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

@inproceedings{pradhan-etal-2013-towards,
    title = "Towards Robust Linguistic Analysis using {O}nto{N}otes",
    author = {Pradhan, Sameer  and
      Moschitti, Alessandro  and
      Xue, Nianwen  and
      Ng, Hwee Tou  and
      Bj{\"o}rkelund, Anders  and
      Uryupina, Olga  and
      Zhang, Yuchen  and
      Zhong, Zhi},
    booktitle = "Proceedings of the Seventeenth Conference on Computational Natural Language Learning",
    month = aug,
    year = "2013",
    address = "Sofia, Bulgaria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/W13-3516",
    pages = "143--152",
}

Contributions

Thanks to @richarddwang for adding this dataset.