数据集:

setswana_ner_corpus

语言:

tn

计算机处理:

monolingual

大小:

1K<n<10K

语言创建人:

found

批注创建人:

expert-generated

源数据集:

original

许可:

other
中文

Dataset Card for Setswana NER Corpus

Dataset Summary

The Setswana Ner Corpus is a Setswana dataset developed by The Centre for Text Technology (CTexT), North-West University, South Africa . The data is based on documents from the South African goverment domain and crawled from gov.za websites. It was created to support NER task for Setswana language. The dataset uses CoNLL shared task annotation standards.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

The language supported is Setswana.

Dataset Structure

Data Instances

A data point consists of sentences seperated by empty line and tab-seperated tokens and tags.

{'id': '0',
 'ner_tags': [0, 0, 0, 0, 0],
 'tokens': ['Ka', 'dinako', 'dingwe', ',', 'go']
}

Data Fields

  • id : id of the sample
  • tokens : the tokens of the example text
  • ner_tags : the NER tags of each token

The NER tags correspond to this list:

"OUT", "B-PERS", "I-PERS", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "B-MISC", "I-MISC",

The NER tags have the same format as in the CoNLL shared task: a B denotes the first item of a phrase and an I any non-initial word. There are four types of phrases: person names (PER), organizations (ORG), locations (LOC) and miscellaneous names (MISC). (OUT) is used for tokens not considered part of any named entity.

Data Splits

The data was not split.

Dataset Creation

Curation Rationale

The data was created to help introduce resources to new language - setswana.

[More Information Needed]

Source Data

Initial Data Collection and Normalization

The data is based on South African government domain and was crawled from gov.za websites.

[More Information Needed]

Who are the source language producers?

The data was produced by writers of South African government websites - gov.za

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

The data was annotated during the NCHLT text resource development project.

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

The annotated data sets were developed by the Centre for Text Technology (CTexT, North-West University, South Africa).

See: more information

Licensing Information

The data is under the Creative Commons Attribution 2.5 South Africa License

Citation Information

@inproceedings{sepedi_ner_corpus,
  author    = {S.S.B.M. Phakedi and
              Roald Eiselen},
  title     = {NCHLT Setswana Named Entity Annotated Corpus},
  booktitle = {Eiselen, R. 2016. Government domain named entity recognition for South African languages. Proceedings of the 10th      Language Resource and Evaluation Conference, Portorož, Slovenia.},
  year      = {2016},
  url       = {https://repo.sadilar.org/handle/20.500.12185/341},
}

Contributions

Thanks to @yvonnegitau for adding this dataset.