数据集:
scb_mt_enth_2020
scb-mt-en-th-2020: A Large English-Thai Parallel Corpus The primary objective of our work is to build a large-scale English-Thai dataset for machine translation. We construct an English-Thai machine translation dataset with over 1 million segment pairs, curated from various sources, namely news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data and government documents. Methodology for gathering data, building parallel texts and removing noisy sentence pairs are presented in a reproducible manner. We train machine translation models based on this dataset. Our models' performance are comparable to that of Google Translation API (as of May 2020) for Thai-English and outperform Google when the Open Parallel Corpus (OPUS) is included in the training data for both Thai-English and English-Thai translation. The dataset, pre-trained models, and source code to reproduce our work are available for public use.
machine translation
English, Thai
{'subdataset': 'aqdf', 'translation': {'en': 'FAR LEFT: Indonesian National Police Chief Tito Karnavian, from left, Philippine National Police Chief Ronald Dela Rosa and Royal Malaysian Police Inspector General Khalid Abu Bakar link arms before the Trilateral Security Meeting in Pasay city, southeast of Manila, Philippines, in June 2017. [THE ASSOCIATED PRESS]', 'th': '(ซ้ายสุด) นายติโต คาร์นาเวียน ผู้บัญชาการตํารวจแห่งชาติอินโดนีเซีย (จากซ้าย) นายโรนัลด์ เดลา โรซา ผู้บัญชาการตํารวจแห่งชาติฟิลิปปินส์ และนายคาลิด อาบู บาการ์ ผู้บัญชาการตํารวจแห่งชาติมาเลเซีย ไขว้แขนกันก่อนเริ่มการประชุมความมั่นคงไตรภาคีในเมืองปาเซย์ ซึ่งอยู่ทางตะวันออกเฉียงใต้ของกรุงมะนิลา ประเทศฟิลิปปินส์ ในเดือนมิถุนายน พ.ศ. 2560 ดิแอสโซซิเอทเต็ด เพรส'}} {'subdataset': 'thai_websites', 'translation': {'en': "*Applicants from certain countries may be required to pay a visa issuance fee after their application is approved. The Department of State's website has more information about visa issuance fees and can help you determine if an issuance fee applies to your nationality.", 'th': 'ประเภทวีซ่า รวมถึงค่าธรรมเนียม และข้อกําหนดในการสัมภาษณ์วีซ่า จะขึ้นอยู่กับชนิดของหนังสือเดินทาง และจุดประสงค์ในการเดินทางของท่าน โปรดดูตารางด้านล่างก่อนการสมัครวีซ่า'}} {'subdataset': 'nus_sms', 'translation': {'en': 'Yup... Okay. Cya tmr... So long nvr write already... Dunno whether tmr can come up with 500 words', 'th': 'ใช่...ได้ แล้วเจอกันพรุ่งนี้... นานแล้วไม่เคยเขียน... ไม่รู้ว่าพรุ่งนี้จะทําได้ถึง500คําไหมเลย'}}
Split ratio (train, valid, test) : (0.8, 0.1, 0.1) Number of paris (train, valid, test): 801,402 | 100,173 | 100,177 # Train generated_reviews_yn: 218,637 ( 27.28% ) task_master_1: 185,671 ( 23.17% ) generated_reviews_translator: 105,561 ( 13.17% ) thai_websites: 93,518 ( 11.67% ) paracrawl: 46,802 ( 5.84% ) nus_sms: 34,495 ( 4.30% ) mozilla_common_voice: 2,451 ( 4.05% ) wikipedia: 26,163 ( 3.26% cd) generated_reviews_crowd: 19,769 ( 2.47% ) assorted_government: 19,712 ( 2.46% ) aqdf: 10,466 ( 1.31% ) msr_paraphrase: 8,157 ( 1.02% ) # Valid generated_reviews_yn: 30,786 ( 30.73% ) task_master_1: 18,531 ( 18.50% ) generated_reviews_translator: 13,884 ( 13.86% ) thai_websites: 13,381 ( 13.36% ) paracrawl: 6,618 ( 6.61% ) nus_sms: 4,628 ( 4.62% ) wikipedia: 3,796 ( 3.79% ) assorted_government: 2,842 ( 2.83% ) generated_reviews_crowd: 2,409 ( 2.40% ) aqdf: 1,518 ( 1.52% ) msr_paraphrase: 1,107 ( 1.11% ) mozilla_common_voice: 673 ( 0.67% ) # Test generated_reviews_yn: 30,785 ( 30.73% ) task_master_1: 18,531 ( 18.50% ) generated_reviews_translator: 13,885 ( 13.86% ) thai_websites: 13,381 ( 13.36% ) paracrawl: 6,619 ( 6.61% ) nus_sms: 4,627 ( 4.62% ) wikipedia: 3,797 ( 3.79% ) assorted_government: 2,844 ( 2.83% ) generated_reviews_crowd: 2,409 ( 2.40% ) aqdf: 1,519 ( 1.52% ) msr_paraphrase: 1,107 ( 1.11% ) mozilla_common_voice : 673 ( 0.67% )
AIResearch , funded by VISTEC and depa , curated this dataset as part of public NLP infrastructure. The center releases the dataset and baseline models under CC-BY-SA 4.0.
The sentence pairs are curated from news, Wikipedia articles, SMS messages, task-based dialogs, webcrawled data and government documents. Sentence pairs are generated by:
For detailed explanation of dataset curation, see https://arxiv.org/pdf/2007.03541.pdf
There are risks of personal information to be included in the webcrawled data namely paracrawl and thai_websites .
Unlike English, there is no segment boundary marking in Thai. One segment in Thai may or may not cover all the content of an English segment. Currently, we mitigate this problem by grouping Thai segments together before computing the text similarity scores. We then choose the combination with the highest text similarity score. It can be said that adequacy is the main issue in building this dataset. Quality of Translation from Crawled Websites Some websites use machine translation models such as Google Translate to localize their content. As a result, Thai segments retrieved from web crawling might face issues of fluency since we do not use human annotators to perform quality control.
Quality Control of Crowdsourced TranslatorsWhen we use a crowdsourcing platform to translate the content, we can not fully control the quality of the translation. To combat this, we filter out low-quality segments by using a text similarity threshold, based on cosine similarity of universal sentence encoder vectors. Moreover, some crowdsourced translators might copy and paste source segments to a translation engine and take the results as answers to the platform. To further improve, we can apply techniques such as described in [Zaidan, 2012] to control the quality and avoid fraud on the platform.
Domain Dependence of Machine Tranlsation ModelsWe test domain dependence of machine translation models by comparing models trained and tested on the same dataset, using 80/10/10 train-validation-test split, and models trained on one dataset and tested on the other.
AIResearch , funded by VISTEC and depa
CC-BY-SA 4.0
@article{lowphansirikul2020scb, title={scb-mt-en-th-2020: A Large English-Thai Parallel Corpus}, author={Lowphansirikul, Lalita and Polpanumas, Charin and Rutherford, Attapol T and Nutanong, Sarana}, journal={arXiv preprint arXiv:2007.03541}, year={2020} }
Thanks to @cstorm125 for adding this dataset.