数据集:

id_nergrit_corpus

语言:

id

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

expert-generated

批注创建人:

expert-generated

源数据集:

original

许可:

other
中文

Dataset Card for [Dataset Name]

Dataset Summary

Nergrit Corpus is a dataset collection of Indonesian Named Entity Recognition, Statement Extraction, and Sentiment Analysis developed by PT Gria Inovasi Teknologi (GRIT) .

Supported Tasks and Leaderboards

[More Information Needed]

Languages

Indonesian

Dataset Structure

A data point consists of sentences seperated by empty line and tab-seperated tokens and tags.

{'id': '0',
 'tokens': ['Gubernur', 'Bank', 'Indonesia', 'menggelar', 'konferensi', 'pers'],
 'ner_tags': [9, 28, 28, 38, 38, 38],
}

Data Instances

[More Information Needed]

Data Fields

  • id : id of the sample
  • tokens : the tokens of the example text
  • ner_tags : the NER tags of each token
Named Entity Recognition

The ner_tags correspond to this list:

"B-CRD", "B-DAT", "B-EVT", "B-FAC", "B-GPE", "B-LAN", "B-LAW", "B-LOC", "B-MON", "B-NOR", 
"B-ORD", "B-ORG", "B-PER", "B-PRC", "B-PRD", "B-QTY", "B-REG", "B-TIM", "B-WOA",
"I-CRD", "I-DAT", "I-EVT", "I-FAC", "I-GPE", "I-LAN", "I-LAW", "I-LOC", "I-MON", "I-NOR",
"I-ORD", "I-ORG", "I-PER", "I-PRC", "I-PRD", "I-QTY", "I-REG", "I-TIM", "I-WOA", "O",

The ner_tags have the same format as in the CoNLL shared task: a B denotes the first item of a phrase and an I any non-initial word. The dataset contains 19 following entities

    'CRD': Cardinal
    'DAT': Date
    'EVT': Event
    'FAC': Facility
    'GPE': Geopolitical Entity
    'LAW': Law Entity (such as Undang-Undang)
    'LOC': Location
    'MON': Money
    'NOR': Political Organization
    'ORD': Ordinal
    'ORG': Organization
    'PER': Person
    'PRC': Percent
    'PRD': Product
    'QTY': Quantity
    'REG': Religion
    'TIM': Time
    'WOA': Work of Art
    'LAN': Language
Sentiment Analysis

The ner_tags correspond to this list:

"B-NEG", "B-NET", "B-POS",
"I-NEG", "I-NET", "I-POS",
"O",
Statement Extraction

The ner_tags correspond to this list:

"B-BREL", "B-FREL", "B-STAT", "B-WHO",
"I-BREL", "I-FREL", "I-STAT", "I-WHO", 
"O"

The ner_tags have the same format as in the CoNLL shared task: a B denotes the first item of a phrase and an I any non-initial word.

Data Splits

The dataset is splitted in to train, validation and test sets.

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

The annotators are listed in the Nergrit Corpus repository

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

[More Information Needed]

Contributions

Thanks to @cahya-wirawan for adding this dataset.