数据集:
indonlp/NusaX-senti
任务:
文本分类计算机处理:
multilingual大小:
10K<n<100K语言创建人:
expert-generated批注创建人:
expert-generated源数据集:
original预印本库:
arxiv:2205.15960许可:
cc-by-sa-4.0NusaX is a high-quality multilingual parallel corpus that covers 12 languages, Indonesian, English, and 10 Indonesian local languages, namely Acehnese, Balinese, Banjarese, Buginese, Madurese, Minangkabau, Javanese, Ngaju, Sundanese, and Toba Batak. NusaX-Senti is a 3-labels (positive, neutral, negative) sentiment analysis dataset for 10 Indonesian local languages + Indonesian and English.
There is a shortage of NLP research and resources for the Indonesian languages, despite the country having over 700 languages. With this in mind, we have created this dataset to support future research for the underrepresented languages in Indonesia.
NusaX-senti is a dataset for sentiment analysis in Indonesian that has been expertly translated by native speakers.
Who are the source language producers?The data was produced by humans (native speakers).
NusaX-senti is derived from SmSA, which is the biggest publicly available dataset for Indonesian sentiment analysis. It comprises of comments and reviews from multiple online platforms. To ensure the quality of our dataset, we have filtered it by removing any abusive language and personally identifying information by manually reviewing all sentences. To ensure balance in the label distribution, we randomly picked 1,000 samples through stratified sampling and then translated them to the corresponding languages.
Who are the annotators?Native speakers of both Indonesian and the corresponding languages. Annotators were compensated based on the number of translated samples.
Personal information is removed.
NusaX is created from review text. These data sources may contain some bias.
No other known limitations
CC-BY-SA 4.0.
Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Please contact authors for any information on the dataset.
@misc{winata2022nusax, title={NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages}, author={Winata, Genta Indra and Aji, Alham Fikri and Cahyawijaya, Samuel and Mahendra, Rahmad and Koto, Fajri and Romadhony, Ade and Kurniawan, Kemal and Moeljadi, David and Prasojo, Radityo Eko and Fung, Pascale and Baldwin, Timothy and Lau, Jey Han and Sennrich, Rico and Ruder, Sebastian}, year={2022}, eprint={2205.15960}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Thanks to @afaji for adding this dataset.