数据集:
w11wo/imdb-javanese
任务:
文本分类语言:
jv计算机处理:
monolingual大小:
10K<n<100K语言创建人:
machine-generated批注创建人:
found源数据集:
original许可:
odblLarge Movie Review Dataset translated to Javanese. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. We translated the original IMDB Dataset to Javanese using the multi-lingual MarianMT Transformer model from Helsinki-NLP/opus-mt-en-mul .
We show detailed information for up to 5 configurations of the dataset.
An example of javanese_imdb_train.csv looks as follows.
label | text |
---|---|
1 | "Drama romantik sing digawé karo direktur Martin Ritt kuwi ora dingertèni, nanging ana momen-momen sing marahi karisma lintang Jane Fonda lan Robert De Niro (kelompok sing luar biasa). Dhèwèké dadi randha sing ora isa mlaku, iso anu anyar lan anyar-inventor-- kowé isa nganggep isiné. Adapsi novel Pat Barker ""Union Street"" (yak titel sing apik!) arep dinggo-back-back it on bland, lan pendidikan film kuwi gampang, nanging isih nyenengké; a rosy-hued-inventor-fantasi. Ora ana sing ngganggu gambar sing sejati ding kok iso dinggo nggawe gambar sing paling nyeneng." |
0 | "Pengalaman wong lanang sing nduwé perasaan sing ora lumrah kanggo babi. Mulai nganggo tuladha sing luar biasa yaiku komedia. Wong orkestra termel digawé dadi wong gila, sing kasar merga nyanyian nyanyi. Sayangé, kuwi tetep absurd wektu WHOLE tanpa ceramah umum sing mung digawé. Malah, sing ana ing jaman kuwi kudu ditinggalké. Diyalog kryptik sing nggawé Shakespeare marah gampang kanggo kelas telu. Pak teknis kuwi luwih apik timbang kowe mikir nganggo cinematografi sing apik sing jenengé Vilmos Zsmond. Masa depan bintang Saly Kirkland lan Frederic Forrest isa ndelok." |
train | unsupervised | test |
---|---|---|
25000 | 50000 | 25000 |
If you use this dataset in your research, please cite:
@inproceedings{wongso2021causal, title={Causal and Masked Language Modeling of Javanese Language using Transformer-based Architectures}, author={Wongso, Wilson and Setiawan, David Samuel and Suhartono, Derwin}, booktitle={2021 International Conference on Advanced Computer Science and Information Systems (ICACSIS)}, pages={1--7}, year={2021}, organization={IEEE} }
@InProceedings{maas-EtAl:2011:ACL-HLT2011, author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, title = {Learning Word Vectors for Sentiment Analysis}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics}, pages = {142--150}, url = {http://www.aclweb.org/anthology/P11-1015} }