mitThe IndoNLU benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems for Bahasa Indonesia (Indonesian language). There are 12 datasets in IndoNLU benchmark for Indonesian natural language understanding.
[Needs More Information]
A data point consists of tweet and label . An example from the train set looks as follows:
{ 'tweet': 'Ini adalah hal yang paling membahagiakan saat biasku foto bersama ELF #ReturnOfTheLittlePrince #HappyHeeChulDay' 'label': 4, }
A data point consists of text and label . An example from the train set looks as follows:
{ 'text': 'warung ini dimiliki oleh pengusaha pabrik tahu yang sudah puluhan tahun terkenal membuat tahu putih di bandung . tahu berkualitas , dipadu keahlian memasak , dipadu kretivitas , jadilah warung yang menyajikan menu utama berbahan tahu , ditambah menu umum lain seperti ayam . semuanya selera indonesia . harga cukup terjangkau . jangan lewatkan tahu bletoka nya , tidak kalah dengan yang asli dari tegal !' 'label': 0, }
A data point consists of sentence and multi-label feature , machine , others , part , price , and service . An example from the train set looks as follows:
{ 'sentence': 'Saya memakai Honda Jazz GK5 tahun 2014 ( pertama meluncur ) . Mobil nya bagus dan enak sesuai moto nya menyenangkan untuk dikendarai', 'fuel': 1, 'machine': 1, 'others': 2, 'part': 1, 'price': 1, 'service': 1 }
A data point consists of sentence and multi-label ac , air_panas , bau , general , kebersihan , linen , service , sunrise_meal , tv , and wifi . An example from the train set looks as follows:
{ 'sentence': 'kebersihan kurang...', 'ac': 1, 'air_panas': 1, 'bau': 1, 'general': 1, 'kebersihan': 0, 'linen': 1, 'service': 1, 'sunrise_meal': 1, 'tv': 1, 'wifi': 1 }
A data point consists of premise , hypothesis , category , and label . An example from the train set looks as follows:
{ 'premise': 'Pada awalnya bangsa Israel hanya terdiri dari satu kelompok keluarga di antara banyak kelompok keluarga yang hidup di tanah Kanan pada abad 18 SM .', 'hypothesis': 'Pada awalnya bangsa Yahudi hanya terdiri dari satu kelompok keluarga di antara banyak kelompok keluarga yang hidup di tanah Kanan pada abad 18 SM .' 'category': 'menolak perubahan teks terakhir oleh istimewa kontribusi pengguna 141 109 98 87 141 109 98 87 dan mengembalikan revisi 6958053 oleh johnthorne', 'label': 0, }
A data point consists of tokens and pos_tags . An example from the train set looks as follows:
{ 'tokens': ['kepala', 'dinas', 'tata', 'kota', 'manado', 'amos', 'kenda', 'menyatakan', 'tidak', 'tahu', '-', 'menahu', 'soal', 'pencabutan', 'baliho', '.', 'ia', 'enggan', 'berkomentar', 'banyak', 'karena', 'merasa', 'bukan', 'kewenangannya', '.'], 'pos_tags': [11, 6, 11, 11, 7, 7, 7, 9, 23, 4, 21, 9, 11, 11, 11, 21, 3, 2, 4, 1, 19, 9, 23, 11, 21] }
A data point consists of tokens and pos_tags . An example from the train set looks as follows:
{ 'tokens': ['Kera', 'untuk', 'amankan', 'pesta', 'olahraga'], 'pos_tags': [27, 8, 26, 27, 30] }
A data point consists of tokens and seq_label . An example from the train set looks as follows:
{ 'tokens': ['kamar', 'saya', 'ada', 'kendala', 'di', 'ac', 'tidak', 'berfungsi', 'optimal', '.', 'dan', 'juga', 'wifi', 'koneksi', 'kurang', 'stabil', '.'], 'seq_label': [1, 1, 1, 1, 1, 4, 3, 0, 0, 1, 1, 1, 4, 2, 3, 0, 1] }
A data point consists of tokens and seq_label . An example from the train set looks as follows:
{ 'tokens': ['Setelah', 'melalui', 'proses', 'telepon', 'yang', 'panjang', 'tutup', 'sudah', 'kartu', 'kredit', 'bca', 'Ribet'], 'seq_label': [0, 1, 1, 2, 0, 0, 1, 0, 1, 2, 2, 1] }
A data point consists of tokens and ner_tags . An example from the train set looks as follows:
{ 'tokens': ['Kontribusinya', 'terhadap', 'industri', 'musik', 'telah', 'mengumpulkan', 'banyak', 'prestasi', 'termasuk', 'lima', 'Grammy', 'Awards', ',', 'serta', 'dua', 'belas', 'nominasi', ';', 'dua', 'Guinness', 'World', 'Records', ';', 'dan', 'penjualannya', 'diperkirakan', 'sekitar', '64', 'juta', 'rekaman', '.'], 'ner_tags': [5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]}
A data point consists of tokens and ner_tags . An example from the train set looks as follows:
{ 'tokens': ['kepala', 'dinas', 'tata', 'kota', 'manado', 'amos', 'kenda', 'menyatakan', 'tidak', 'tahu', '-', 'menahu', 'soal', 'pencabutan', 'baliho', '.', 'ia', 'enggan', 'berkomentar', 'banyak', 'karena', 'merasa', 'bukan', 'kewenangannya', '.'], 'ner_tags': [9, 9, 9, 9, 2, 7, 0, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9] }
A data point consists of question , passage , and seq_label . An example from the train set looks as follows:
{ 'passage': ['Lewat', 'telepon', 'ke', 'kantor', 'berita', 'lokal', 'Current', 'News', 'Service', ',', 'Hezb-ul', 'Mujahedeen', ',', 'kelompok', 'militan', 'Kashmir', 'yang', 'terbesar', ',', 'menyatakan', 'bertanggung', 'jawab', 'atas', 'ledakan', 'di', 'Srinagar', '.'], 'question': ['Kelompok', 'apakah', 'yang', 'menyatakan', 'bertanggung', 'jawab', 'atas', 'ledakan', 'di', 'Srinagar', '?'], 'seq_label': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] }
The POS tag labels follow the Indonesian Association of Computational Linguistics (INACL) POS Tagging Convention .
The POS tag labels from Tagset UI .
The labels use Inside-Outside-Beginning (IOB) tagging.
The labels use Inside-Outside-Beginning (IOB) tagging.
The data is split into a training, validation and test set.
dataset | Train | Valid | Test | |
1 | EmoT | 3521 | 440 | 440 |
2 | SmSA | 11000 | 1260 | 500 |
3 | CASA | 810 | 90 | 180 |
4 | HoASA | 2283 | 285 | 286 |
5 | WReTE | 300 | 50 | 100 |
6 | POSP | 6720 | 840 | 840 |
7 | BaPOS | 8000 | 1000 | 1029 |
8 | TermA | 3000 | 1000 | 1000 |
9 | KEPS | 800 | 200 | 247 |
10 | NERGrit | 1672 | 209 | 209 |
11 | NERP | 6720 | 840 | 840 |
12 | FacQA | 2495 | 311 | 311 |
[Needs More Information]
[Needs More Information]
Who are the source language producers?[Needs More Information]
[Needs More Information]
Who are the annotators?[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
The licensing status of the IndoNLU benchmark datasets is under MIT License.
IndoNLU citation
@inproceedings{wilie2020indonlu, title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding}, author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti}, booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing}, year={2020} }
EmoT dataset citation
@inproceedings{saputri2018emotion, title={Emotion Classification on Indonesian Twitter Dataset}, author={Mei Silviana Saputri, Rahmad Mahendra, and Mirna Adriani}, booktitle={Proceedings of the 2018 International Conference on Asian Language Processing(IALP)}, pages={90--95}, year={2018}, organization={IEEE} }
SmSA dataset citation
@inproceedings{purwarianti2019improving, title={Improving Bi-LSTM Performance for Indonesian Sentiment Analysis Using Paragraph Vector}, author={Ayu Purwarianti and Ida Ayu Putu Ari Crisdayanti}, booktitle={Proceedings of the 2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA)}, pages={1--5}, year={2019}, organization={IEEE} }
CASA dataset citation
@inproceedings{ilmania2018aspect, title={Aspect Detection and Sentiment Classification Using Deep Neural Network for Indonesian Aspect-based Sentiment Analysis}, author={Arfinda Ilmania, Abdurrahman, Samuel Cahyawijaya, Ayu Purwarianti}, booktitle={Proceedings of the 2018 International Conference on Asian Language Processing(IALP)}, pages={62--67}, year={2018}, organization={IEEE} }
HoASA dataset citation
@inproceedings{azhar2019multi, title={Multi-label Aspect Categorization with Convolutional Neural Networks and Extreme Gradient Boosting}, author={A. N. Azhar, M. L. Khodra, and A. P. Sutiono} booktitle={Proceedings of the 2019 International Conference on Electrical Engineering and Informatics (ICEEI)}, pages={35--40}, year={2019} }
WReTE dataset citation
@inproceedings{setya2018semi, title={Semi-supervised Textual Entailment on Indonesian Wikipedia Data}, author={Ken Nabila Setya and Rahmad Mahendra}, booktitle={Proceedings of the 2018 International Conference on Computational Linguistics and Intelligent Text Processing (CICLing)}, year={2018} }
POSP dataset citation
@inproceedings{hoesen2018investigating, title={Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger}, author={Devin Hoesen and Ayu Purwarianti}, booktitle={Proceedings of the 2018 International Conference on Asian Language Processing (IALP)}, pages={35--38}, year={2018}, organization={IEEE} }
BaPOS dataset citation
@inproceedings{dinakaramani2014designing, title={Designing an Indonesian Part of Speech Tagset and Manually Tagged Indonesian Corpus}, author={Arawinda Dinakaramani, Fam Rashel, Andry Luthfi, and Ruli Manurung}, booktitle={Proceedings of the 2014 International Conference on Asian Language Processing (IALP)}, pages={66--69}, year={2014}, organization={IEEE} } @inproceedings{kurniawan2018toward, title={Toward a Standardized and More Accurate Indonesian Part-of-Speech Tagging}, author={Kemal Kurniawan and Alham Fikri Aji}, booktitle={Proceedings of the 2018 International Conference on Asian Language Processing (IALP)}, pages={303--307}, year={2018}, organization={IEEE} }
TermA dataset citation
@article{winatmoko2019aspect, title={Aspect and Opinion Term Extraction for Hotel Reviews Using Transfer Learning and Auxiliary Labels}, author={Yosef Ardhito Winatmoko, Ali Akbar Septiandri, Arie Pratama Sutiono}, journal={arXiv preprint arXiv:1909.11879}, year={2019} } @article{fernando2019aspect, title={Aspect and Opinion Terms Extraction Using Double Embeddings and Attention Mechanism for Indonesian Hotel Reviews}, author={Jordhy Fernando, Masayu Leylia Khodra, Ali Akbar Septiandri}, journal={arXiv preprint arXiv:1908.04899}, year={2019} }
KEPS dataset citation
@inproceedings{mahfuzh2019improving, title={Improving Joint Layer RNN based Keyphrase Extraction by Using Syntactical Features}, author={Miftahul Mahfuzh, Sidik Soleman, and Ayu Purwarianti}, booktitle={Proceedings of the 2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA)}, pages={1--6}, year={2019}, organization={IEEE} }
NERGrit dataset citation
@online{nergrit2019, title={NERGrit Corpus}, author={NERGrit Developers}, year={2019}, url={https://github.com/grit-id/nergrit-corpus} }
NERP dataset citation
@inproceedings{hoesen2018investigating, title={Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger}, author={Devin Hoesen and Ayu Purwarianti}, booktitle={Proceedings of the 2018 International Conference on Asian Language Processing (IALP)}, pages={35--38}, year={2018}, organization={IEEE} }
FacQA dataset citation
@inproceedings{purwarianti2007machine, title={A Machine Learning Approach for Indonesian Question Answering System}, author={Ayu Purwarianti, Masatoshi Tsuchiya, and Seiichi Nakagawa}, booktitle={Proceedings of Artificial Intelligence and Applications }, pages={573--578}, year={2007} }
Thanks to @yasirabd for adding this dataset.