数据集:
indonlp/NusaX-MT
许可:
cc-by-sa-4.0预印本库:
arxiv:2205.15960源数据集:
original批注创建人:
expert-generated语言创建人:
expert-generated大小:
10K<n<100K计算机处理:
multilingual任务:
翻译NusaX is a high-quality multilingual parallel corpus that covers 12 languages, Indonesian, English, and 10 Indonesian local languages, namely Acehnese, Balinese, Banjarese, Buginese, Madurese, Minangkabau, Javanese, Ngaju, Sundanese, and Toba Batak. NusaX-MT is a parallel corpus for training and benchmarking machine translation models across 10 Indonesian local languages + Indonesian and English. The data is presented in csv format with 12 columns, one column for each language.
All possible pairs of the following:
There is a shortage of NLP research and resources for the Indonesian languages, despite the country having over 700 languages. With this in mind, we have created this dataset to support future research for the underrepresented languages in Indonesia.
NusaX-MT is a dataset for machine translation in Indonesian langauges that has been expertly translated by native speakers.
Who are the source language producers?The data was produced by humans (native speakers).
NusaX-MT is derived from SmSA, which is the biggest publicly available dataset for Indonesian sentiment analysis. It comprises of comments and reviews from multiple online platforms. To ensure the quality of our dataset, we have filtered it by removing any abusive language and personally identifying information by manually reviewing all sentences. To ensure balance in the label distribution, we randomly picked 1,000 samples through stratified sampling and then translated them to the corresponding languages.
Who are the annotators?Native speakers of both Indonesian and the corresponding languages. Annotators were compensated based on the number of translated samples.
Personal information is removed.
NusaX is created from review text. These data sources may contain some bias.
No other known limitations
CC-BY-SA 4.0.
Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Please contact authors for any information on the dataset.
@misc{winata2022nusax, title={NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages}, author={Winata, Genta Indra and Aji, Alham Fikri and Cahyawijaya, Samuel and Mahendra, Rahmad and Koto, Fajri and Romadhony, Ade and Kurniawan, Kemal and Moeljadi, David and Prasojo, Radityo Eko and Fung, Pascale and Baldwin, Timothy and Lau, Jey Han and Sennrich, Rico and Ruder, Sebastian}, year={2022}, eprint={2205.15960}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Thanks to @afaji for adding this dataset.