数据集:

indonlp/indonlu

任务:

问答

文本分类

标记分类

子任务:

closed-domain-qa multi-class-classification named-entity-recognition

语言:

计算机处理:

monolingual

大小:

10K<n<100K 1K<n<10K n<1K

语言创建人:

expert-generated

批注创建人:

expert-generated

源数据集:

original

预印本库:

arxiv:1809.03391

其他:

keyphrase-extraction span-extraction aspect-based-sentiment-analysis

许可:

mit

数据集介绍文件清单

中文

Dataset Card for IndoNLU

Dataset Summary

The IndoNLU benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems for Bahasa Indonesia (Indonesian language). There are 12 datasets in IndoNLU benchmark for Indonesian natural language understanding.

EmoT : An emotion classification dataset collected from the social media platform Twitter. The dataset consists of around 4000 Indonesian colloquial language tweets, covering five different emotion labels: anger, fear, happy, love, and sadness

SmSA : This sentence-level sentiment analysis dataset is a collection of comments and reviews in Indonesian obtained from multiple online platforms. The text was crawled and then annotated by several Indonesian linguists to construct this dataset. There are three possible sentiments on the SmSA dataset: positive, negative, and neutral

CASA : An aspect-based sentiment analysis dataset consisting of around a thousand car reviews collected from multiple Indonesian online automobile platforms. The dataset covers six aspects of car quality. We define the task to be a multi-label classification task, where each label represents a sentiment for a single aspect with three possible values: positive, negative, and neutral.

HoASA : An aspect-based sentiment analysis dataset consisting of hotel reviews collected from the hotel aggregator platform, AiryRooms . The dataset covers ten different aspects of hotel quality. Similar to the CASA dataset, each review is labeled with a single sentiment label for each aspect. There are four possible sentiment classes for each sentiment label: positive, negative, neutral, and positive-negative. The positivenegative label is given to a review that contains multiple sentiments of the same aspect but for different objects (e.g., cleanliness of bed and toilet).

WReTE : The Wiki Revision Edits Textual Entailment dataset consists of 450 sentence pairs constructed from Wikipedia revision history. The dataset contains pairs of sentences and binary semantic relations between the pairs. The data are labeled as entailed when the meaning of the second sentence can be derived from the first one, and not entailed otherwise.

POSP : This Indonesian part-of-speech tagging (POS) dataset is collected from Indonesian news websites. The dataset consists of around 8000 sentences with 26 POS tags. The POS tag labels follow the Indonesian Association of Computational Linguistics (INACL) POS Tagging Convention .

BaPOS : This POS tagging dataset contains about 1000 sentences, collected from the PAN Localization Project . In this dataset, each word is tagged by one of 23 POS tag classes . Data splitting used in this benchmark follows the experimental setting used by Kurniawan and Aji (2018) .

TermA : This span-extraction dataset is collected from the hotel aggregator platform, AiryRooms . The dataset consists of thousands of hotel reviews, which each contain a span label for aspect and sentiment words representing the opinion of the reviewer on the corresponding aspect. The labels use Inside-Outside-Beginning (IOB) tagging representation with two kinds of tags, aspect and sentiment.

KEPS : This keyphrase extraction dataset consists of text from Twitter discussing banking products and services and is written in the Indonesian language. A phrase containing important information is considered a keyphrase. Text may contain one or more keyphrases since important phrases can be located at different positions. The dataset follows the IOB chunking format, which represents the position of the keyphrase.

NERGrit : This NER dataset is taken from the Grit-ID repository , and the labels are spans in IOB chunking representation. The dataset consists of three kinds of named entity tags, PERSON (name of person), PLACE (name of location), and ORGANIZATION (name of organization).

NERP : This NER dataset (Hoesen and Purwarianti, 2018) contains texts collected from several Indonesian news websites. There are five labels available in this dataset, PER (name of person), LOC (name of location), IND (name of product or brand), EVT (name of the event), and FNB (name of food and beverage). Similar to the TermA dataset, the NERP dataset uses the IOB chunking format.

FacQA : The goal of the FacQA dataset is to find the answer to a question from a provided short passage from a news article. Each row in the FacQA dataset consists of a question, a short passage, and a label phrase, which can be found inside the corresponding short passage. There are six categories of questions: date, location, name, organization, person, and quantitative.

Supported Tasks and Leaderboards

[Needs More Information]

Languages

Indonesian

Dataset Structure

Data Instances

EmoT dataset

A data point consists of tweet and label . An example from the train set looks as follows:

{
  'tweet': 'Ini adalah hal yang paling membahagiakan saat biasku foto bersama ELF #ReturnOfTheLittlePrince #HappyHeeChulDay'
  'label': 4,
}

SmSA dataset

A data point consists of text and label . An example from the train set looks as follows:

{
  'text': 'warung ini dimiliki oleh pengusaha pabrik tahu yang sudah puluhan tahun terkenal membuat tahu putih di bandung . tahu berkualitas , dipadu keahlian memasak , dipadu kretivitas , jadilah warung yang menyajikan menu utama berbahan tahu , ditambah menu umum lain seperti ayam . semuanya selera indonesia . harga cukup terjangkau . jangan lewatkan tahu bletoka nya , tidak kalah dengan yang asli dari tegal !'
  'label': 0,
}

CASA dataset

A data point consists of sentence and multi-label feature , machine , others , part , price , and service . An example from the train set looks as follows:

{
  'sentence': 'Saya memakai Honda Jazz GK5 tahun 2014 ( pertama meluncur ) . Mobil nya bagus dan enak sesuai moto nya menyenangkan untuk dikendarai',
  'fuel': 1,
  'machine': 1,
  'others': 2,
  'part': 1,
  'price': 1,
  'service': 1
}

HoASA dataset

A data point consists of sentence and multi-label ac , air_panas , bau , general , kebersihan , linen , service , sunrise_meal , tv , and wifi . An example from the train set looks as follows:

{
  'sentence': 'kebersihan kurang...',
  'ac': 1,
  'air_panas': 1,
  'bau': 1,
  'general': 1,
  'kebersihan': 0,
  'linen': 1,
  'service': 1,
  'sunrise_meal': 1,
  'tv': 1,
  'wifi': 1
}

WreTE dataset

A data point consists of premise , hypothesis , category , and label . An example from the train set looks as follows:

{
  'premise': 'Pada awalnya bangsa Israel hanya terdiri dari satu kelompok keluarga di antara banyak kelompok keluarga yang hidup di tanah Kanan pada abad 18 SM .',
  'hypothesis': 'Pada awalnya bangsa Yahudi hanya terdiri dari satu kelompok keluarga di antara banyak kelompok keluarga yang hidup di tanah Kanan pada abad 18 SM .'
  'category': 'menolak perubahan teks terakhir oleh istimewa kontribusi pengguna 141 109 98 87 141 109 98 87 dan mengembalikan revisi 6958053 oleh johnthorne',
  'label': 0,
}

POSP dataset

A data point consists of tokens and pos_tags . An example from the train set looks as follows:

{
  'tokens': ['kepala', 'dinas', 'tata', 'kota', 'manado', 'amos', 'kenda', 'menyatakan', 'tidak', 'tahu', '-', 'menahu', 'soal', 'pencabutan', 'baliho', '.', 'ia', 'enggan', 'berkomentar', 'banyak', 'karena', 'merasa', 'bukan', 'kewenangannya', '.'],
  'pos_tags': [11, 6, 11, 11, 7, 7, 7, 9, 23, 4, 21, 9, 11, 11, 11, 21, 3, 2, 4, 1, 19, 9, 23, 11, 21]
}

BaPOS dataset

A data point consists of tokens and pos_tags . An example from the train set looks as follows:

{
  'tokens': ['Kera', 'untuk', 'amankan', 'pesta', 'olahraga'],
  'pos_tags': [27, 8, 26, 27, 30]
}

TermA dataset

A data point consists of tokens and seq_label . An example from the train set looks as follows:

{
  'tokens': ['kamar', 'saya', 'ada', 'kendala', 'di', 'ac', 'tidak', 'berfungsi', 'optimal', '.', 'dan', 'juga', 'wifi', 'koneksi', 'kurang', 'stabil', '.'],
  'seq_label': [1, 1, 1, 1, 1, 4, 3, 0, 0, 1, 1, 1, 4, 2, 3, 0, 1]
}

KEPS dataset

A data point consists of tokens and seq_label . An example from the train set looks as follows:

{
  'tokens': ['Setelah', 'melalui', 'proses', 'telepon', 'yang', 'panjang', 'tutup', 'sudah', 'kartu', 'kredit', 'bca', 'Ribet'],
  'seq_label': [0, 1, 1, 2, 0, 0, 1, 0, 1, 2, 2, 1]
}

NERGrit dataset

A data point consists of tokens and ner_tags . An example from the train set looks as follows:

{
  'tokens': ['Kontribusinya', 'terhadap', 'industri', 'musik', 'telah', 'mengumpulkan', 'banyak', 'prestasi', 'termasuk', 'lima', 'Grammy', 'Awards', ',', 'serta', 'dua', 'belas', 'nominasi', ';', 'dua', 'Guinness', 'World', 'Records', ';', 'dan', 'penjualannya', 'diperkirakan', 'sekitar', '64', 'juta', 'rekaman', '.'],
  'ner_tags': [5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]}

NERP dataset

A data point consists of tokens and ner_tags . An example from the train set looks as follows:

{
  'tokens': ['kepala', 'dinas', 'tata', 'kota', 'manado', 'amos', 'kenda', 'menyatakan', 'tidak', 'tahu', '-', 'menahu', 'soal', 'pencabutan', 'baliho', '.', 'ia', 'enggan', 'berkomentar', 'banyak', 'karena', 'merasa', 'bukan', 'kewenangannya', '.'],
  'ner_tags': [9, 9, 9, 9, 2, 7, 0, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9]
}

FacQA dataset

A data point consists of question , passage , and seq_label . An example from the train set looks as follows:

{
  'passage': ['Lewat', 'telepon', 'ke', 'kantor', 'berita', 'lokal', 'Current', 'News', 'Service', ',', 'Hezb-ul', 'Mujahedeen', ',', 'kelompok', 'militan', 'Kashmir', 'yang', 'terbesar', ',', 'menyatakan', 'bertanggung', 'jawab', 'atas', 'ledakan', 'di', 'Srinagar', '.'],
  'question': ['Kelompok', 'apakah', 'yang', 'menyatakan', 'bertanggung', 'jawab', 'atas', 'ledakan', 'di', 'Srinagar', '?'],
  'seq_label': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
}

Data Fields

EmoT dataset

tweet : a string feature.
label : an emotion label, with possible values including sadness , anger , love , fear , happy .

SmSA dataset

text : a string feature.
label : a sentiment label, with possible values including positive , neutral , negative .

CASA dataset

sentence : a string feature.
fuel : a sentiment label, with possible values including negative , neutral , positive .
machine : a sentiment label, with possible values including negative , neutral , positive .
others : a sentiment label, with possible values including negative , neutral , positive .
part : a sentiment label, with possible values including negative , neutral , positive .
price : a sentiment label, with possible values including negative , neutral , positive .
service : a sentiment label, with possible values including negative , neutral , positive .

HoASA dataset

sentence : a string feature.
ac : a sentiment label, with possible values including neg , neut , pos , neg_pos .
air_panas : a sentiment label, with possible values including neg , neut , pos , neg_pos .
bau : a sentiment label, with possible values including neg , neut , pos , neg_pos .
general : a sentiment label, with possible values including neg , neut , pos , neg_pos .
kebersihan : a sentiment label, with possible values including neg , neut , pos , neg_pos .
linen : a sentiment label, with possible values including neg , neut , pos , neg_pos .
service : a sentiment label, with possible values including neg , neut , pos , neg_pos .
sunrise_meal : a sentiment label, with possible values including neg , neut , pos , neg_pos .
tv : a sentiment label, with possible values including neg , neut , pos , neg_pos .
wifi : a sentiment label, with possible values including neg , neut , pos , neg_pos .

WReTE dataset

premise : a string feature.
hypothesis : a string feature.
category : a string feature.
label : a classification label, with possible values including NotEntail , Entail_or_Paraphrase .

POSP dataset

tokens : a list of string features.
pos_tags : a list of POS tag labels, with possible values including B-PPO , B-KUA , B-ADV , B-PRN , B-VBI .

The POS tag labels follow the Indonesian Association of Computational Linguistics (INACL) POS Tagging Convention .

BaPOS dataset

tokens : a list of string features.
pos_tags : a list of POS tag labels, with possible values including B-PR , B-CD , I-PR , B-SYM , B-JJ .

The POS tag labels from Tagset UI .

TermA dataset

tokens : a list of string features.
seq_label : a list of classification labels, with possible values including I-SENTIMENT , O , I-ASPECT , B-SENTIMENT , B-ASPECT .

KEPS dataset

tokens : a list of string features.
seq_label : a list of classification labels, with possible values including O , B , I .

The labels use Inside-Outside-Beginning (IOB) tagging.

NERGrit dataset

tokens : a list of string features.
ner_tags : a list of NER tag labels, with possible values including I-PERSON , B-ORGANISATION , I-ORGANISATION , B-PLACE , I-PLACE .

The labels use Inside-Outside-Beginning (IOB) tagging.

NERP dataset

tokens : a list of string features.
ner_tags : a list of NER tag labels, with possible values including I-PPL , B-EVT , B-PLC , I-IND , B-IND .

FacQA dataset

question : a list of string features.
passage : a list of string features.
seq_label : a list of classification labels, with possible values including O , B , I .

Data Splits

The data is split into a training, validation and test set.

dataset	Train	Valid	Test
1	EmoT	3521	440	440
2	SmSA	11000	1260	500
3	CASA	810	90	180
4	HoASA	2283	285	286
5	WReTE	300	50	100
6	POSP	6720	840	840
7	BaPOS	8000	1000	1029
8	TermA	3000	1000	1000
9	KEPS	800	200	247
10	NERGrit	1672	209	209
11	NERP	6720	840	840
12	FacQA	2495	311	311

Dataset Creation

Curation Rationale

[Needs More Information]

Source Data

Initial Data Collection and Normalization

[Needs More Information]

Who are the source language producers?

[Needs More Information]

Annotations

Annotation process

[Needs More Information]

Who are the annotators?

[Needs More Information]

Personal and Sensitive Information

[Needs More Information]

Considerations for Using the Data

Social Impact of Dataset

[Needs More Information]

Discussion of Biases

[Needs More Information]

Other Known Limitations

[Needs More Information]

Additional Information

Dataset Curators

[Needs More Information]

Licensing Information

The licensing status of the IndoNLU benchmark datasets is under MIT License.

Citation Information

IndoNLU citation

@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  year={2020}
}

EmoT dataset citation

@inproceedings{saputri2018emotion,
  title={Emotion Classification on Indonesian Twitter Dataset},
  author={Mei Silviana Saputri, Rahmad Mahendra, and Mirna Adriani},
  booktitle={Proceedings of the 2018 International Conference on Asian Language Processing(IALP)},
  pages={90--95},
  year={2018},
  organization={IEEE}
}

SmSA dataset citation

@inproceedings{purwarianti2019improving,
  title={Improving Bi-LSTM Performance for Indonesian Sentiment Analysis Using Paragraph Vector},
  author={Ayu Purwarianti and Ida Ayu Putu Ari Crisdayanti},
  booktitle={Proceedings of the 2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA)},
  pages={1--5},
  year={2019},
  organization={IEEE}
}

CASA dataset citation

@inproceedings{ilmania2018aspect,
  title={Aspect Detection and Sentiment Classification Using Deep Neural Network for Indonesian Aspect-based Sentiment Analysis},
  author={Arfinda Ilmania, Abdurrahman, Samuel Cahyawijaya, Ayu Purwarianti},
  booktitle={Proceedings of the 2018 International Conference on Asian Language Processing(IALP)},
  pages={62--67},
  year={2018},
  organization={IEEE}
}

HoASA dataset citation

@inproceedings{azhar2019multi,
  title={Multi-label Aspect Categorization with Convolutional Neural Networks and Extreme Gradient Boosting},
  author={A. N. Azhar, M. L. Khodra, and A. P. Sutiono}
  booktitle={Proceedings of the 2019 International Conference on Electrical Engineering and Informatics (ICEEI)},
  pages={35--40},
  year={2019}
}

WReTE dataset citation

@inproceedings{setya2018semi,
  title={Semi-supervised Textual Entailment on Indonesian Wikipedia Data},
  author={Ken Nabila Setya and Rahmad Mahendra},
  booktitle={Proceedings of the 2018 International Conference on Computational Linguistics and Intelligent Text Processing (CICLing)},
  year={2018}
}

POSP dataset citation

@inproceedings{hoesen2018investigating,
  title={Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger},
  author={Devin Hoesen and Ayu Purwarianti},
  booktitle={Proceedings of the 2018 International Conference on Asian Language Processing (IALP)},
  pages={35--38},
  year={2018},
  organization={IEEE}
}

BaPOS dataset citation

@inproceedings{dinakaramani2014designing,
  title={Designing an Indonesian Part of Speech Tagset and Manually Tagged Indonesian Corpus},
  author={Arawinda Dinakaramani, Fam Rashel, Andry Luthfi, and Ruli Manurung},
  booktitle={Proceedings of the 2014 International Conference on Asian Language Processing (IALP)},
  pages={66--69},
  year={2014},
  organization={IEEE}
}
@inproceedings{kurniawan2018toward,
  title={Toward a Standardized and More Accurate Indonesian Part-of-Speech Tagging},
  author={Kemal Kurniawan and Alham Fikri Aji},
  booktitle={Proceedings of the 2018 International Conference on Asian Language Processing (IALP)},
  pages={303--307},
  year={2018},
  organization={IEEE}
}

TermA dataset citation

@article{winatmoko2019aspect,
  title={Aspect and Opinion Term Extraction for Hotel Reviews Using Transfer Learning and Auxiliary Labels},
  author={Yosef Ardhito Winatmoko, Ali Akbar Septiandri, Arie Pratama Sutiono},
  journal={arXiv preprint arXiv:1909.11879},
  year={2019}
}
@article{fernando2019aspect,
  title={Aspect and Opinion Terms Extraction Using Double Embeddings and Attention Mechanism for Indonesian Hotel Reviews},
  author={Jordhy Fernando, Masayu Leylia Khodra, Ali Akbar Septiandri},
  journal={arXiv preprint arXiv:1908.04899},
  year={2019}
}

KEPS dataset citation

@inproceedings{mahfuzh2019improving,
  title={Improving Joint Layer RNN based Keyphrase Extraction by Using Syntactical Features},
  author={Miftahul Mahfuzh, Sidik Soleman, and Ayu Purwarianti},
  booktitle={Proceedings of the 2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA)},
  pages={1--6},
  year={2019},
  organization={IEEE}
}

NERGrit dataset citation

@online{nergrit2019,
  title={NERGrit Corpus},
  author={NERGrit Developers},
  year={2019},
  url={https://github.com/grit-id/nergrit-corpus}
}

NERP dataset citation

@inproceedings{hoesen2018investigating,
  title={Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger},
  author={Devin Hoesen and Ayu Purwarianti},
  booktitle={Proceedings of the 2018 International Conference on Asian Language Processing (IALP)},
  pages={35--38},
  year={2018},
  organization={IEEE}
}

FacQA dataset citation

@inproceedings{purwarianti2007machine,
  title={A Machine Learning Approach for Indonesian Question Answering System},
  author={Ayu Purwarianti, Masatoshi Tsuchiya, and Seiichi Nakagawa},
  booktitle={Proceedings of Artificial Intelligence and Applications },
  pages={573--578},
  year={2007}
}

Contributions

Thanks to @yasirabd for adding this dataset.

作者:

indonlp

数据集大小:

102.5 KB