数据集:

tomaarsen/conllpp

语言:

en

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

found

批注创建人:

expert-generated

源数据集:

extended|conll2003
中文

Dataset Card for "conllpp"

Dataset Summary

CoNLLpp is a corrected version of the CoNLL2003 NER dataset where labels of 5.38% of the sentences in the test set have been manually corrected. The training set and development set from CoNLL2003 is included for completeness. One correction on the test set for example, is:

{
    "tokens": ["SOCCER", "-", "JAPAN", "GET", "LUCKY", "WIN", ",", "CHINA", "IN", "SURPRISE", "DEFEAT", "."],
    "original_ner_tags_in_conll2003": ["O", "O", "B-LOC", "O", "O", "O", "O", "B-PER", "O", "O", "O", "O"],
    "corrected_ner_tags_in_conllpp": ["O", "O", "B-LOC", "O", "O", "O", "O", "B-LOC", "O", "O", "O", "O"],
}

Supported Tasks and Leaderboards

[More Information Needed]

Languages

[More Information Needed]

Dataset Structure

Data Instances

conllpp
  • Size of downloaded dataset files: 4.85 MB
  • Size of the generated dataset: 10.26 MB
  • Total amount of disk used: 15.11 MB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "id": "0",
    "document_id": 1,
    "sentence_id": 3,
    "tokens": ["The", "European", "Commission", "said", "on", "Thursday", "it", "disagreed", "with", "German", "advice", "to", "consumers", "to", "shun", "British", "lamb", "until", "scientists", "determine", "whether", "mad", "cow", "disease", "can", "be", "transmitted", "to", "sheep", "."]
    "pos_tags": [12, 22, 22, 38, 15, 22, 28, 38, 15, 16, 21, 35, 24, 35, 37, 16, 21, 15, 24, 41, 15, 16, 21, 21, 20, 37, 40, 35, 21, 7],
    "ner_tags": [0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    "chunk_tags": [11, 12, 12, 21, 13, 11, 11, 21, 13, 11, 12, 13, 11, 21, 22, 11, 12, 17, 11, 21, 17, 11, 12, 12, 21, 22, 22, 13, 11, 0],
}

Data Fields

The data fields are the same among all splits.

conllpp
  • id : a string feature.
  • document_id : an int32 feature tracking which document the sample is from.
  • sentence_id : an int32 feature tracking which sentence in this document the sample is from.
  • tokens : a list of string features.
  • pos_tags : a list of classification labels, with possible values including " (0), '' (1), # (2), $ (3), ( (4).
  • chunk_tags : a list of classification labels, with possible values including O (0), B-ADJP (1), I-ADJP (2), B-ADVP (3), I-ADVP (4).
  • ner_tags : a list of classification labels, with possible values including O (0), B-PER (1), I-PER (2), B-ORG (3), I-ORG (4).

Data Splits

name train validation test
conll2003 14041 3250 3453

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

@inproceedings{wang2019crossweigh,
  title={CrossWeigh: Training Named Entity Tagger from Imperfect Annotations},
  author={Wang, Zihan and Shang, Jingbo and Liu, Liyuan and Lu, Lihao and Liu, Jiacheng and Han, Jiawei},
  booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
  pages={5157--5166},
  year={2019}
}

Contributions

Thanks to @ZihanWangKi for adding this dataset.