数据集:
acul3/KoPI-NLLB
KopI(Korpus Perayapan Indonesia)-NLLB, is Indonesian family language(aceh,bali,banjar,indonesia,jawa,minang,sunda) only extracted from NLLB Dataset, allenai/nllb
each language set also filtered using some some deduplicate technique such as exact hash(md5) dedup technique and minhash LSH neardup
detail soon