中文

Dataset Card for generated_reviews_enth

Dataset Summary

generated_reviews_enth is created as part of scb-mt-en-th-2020 for machine translation task. This dataset (referred to as generated_reviews_yn in scb-mt-en-th-2020 ) are English product reviews generated by CTRL , translated by Google Translate API and annotated as accepted or rejected ( correct ) based on fluency and adequacy of the translation by human annotators. This allows it to be used for English-to-Thai translation quality esitmation (binary label), machine translation, and sentiment analysis.

Supported Tasks and Leaderboards

English-to-Thai translation quality estimation (binary label) is the intended use. Other uses include machine translation and sentiment analysis.

Languages

English, Thai

Dataset Structure

Data Instances

{'correct': 0, 'review_star': 4, 'translation': {'en': "I had a hard time finding a case for my new LG Lucid 2 but finally found this one on amazon. The colors are really pretty and it works just as well as, if not better than the otterbox. Hopefully there will be more available by next Xmas season. Overall, very cute case. I love cheetah's. :)", 'th': 'ฉันมีปัญหาในการหาเคสสำหรับ LG Lucid 2 ใหม่ของฉัน แต่ในที่สุดก็พบเคสนี้ใน Amazon สีสวยมากและใช้งานได้ดีเช่นเดียวกับถ้าไม่ดีกว่านาก หวังว่าจะมีให้มากขึ้นในช่วงเทศกาลคริสต์มาสหน้า โดยรวมแล้วน่ารักมาก ๆ ฉันรักเสือชีตาห์ :)'}}
{'correct': 0, 'review_star': 1, 'translation': {'en': "This is the second battery charger I bought as a Christmas present, that came from Amazon, after one purchased before for my son. His was still working. The first charger, received in July, broke apart and wouldn't charge anymore. Just found out two days ago they discontinued it without warning. It took quite some time to find the exact replacement charger. Too bad, really liked it. One of these days, will purchase an actual Nikon product, or go back to buying batteries.", 'th': 'นี่เป็นเครื่องชาร์จแบตเตอรี่ก้อนที่สองที่ฉันซื้อเป็นของขวัญคริสต์มาสซึ่งมาจากอเมซอนหลังจากที่ซื้อมาเพื่อลูกชายของฉัน เขายังทำงานอยู่ เครื่องชาร์จแรกที่ได้รับในเดือนกรกฎาคมแตกเป็นชิ้น ๆ และจะไม่ชาร์จอีกต่อไป เพิ่งค้นพบเมื่อสองวันก่อนพวกเขาหยุดมันโดยไม่มีการเตือนล่วงหน้า ใช้เวลาพอสมควรในการหาที่ชาร์จที่ถูกต้อง แย่มากชอบมาก สักวันหนึ่งจะซื้อผลิตภัณฑ์ Nikon จริงหรือกลับไปซื้อแบตเตอรี่'}}
{'correct': 1, 'review_star': 1, 'translation': {'en': 'I loved the idea of having a portable computer to share pictures with family and friends on my big screen. It worked really well for about 3 days, then when i opened it one evening there was water inside where all the wires came out. I cleaned that up and put some tape over that, so far, no leaks. My husband just told me yesterday, however, that this thing is trash.', 'th': 'ฉันชอบไอเดียที่มีคอมพิวเตอร์พกพาเพื่อแชร์รูปภาพกับครอบครัวและเพื่อน ๆ บนหน้าจอขนาดใหญ่ของฉัน มันใช้งานได้ดีจริง ๆ ประมาณ 3 วันจากนั้นเมื่อฉันเปิดมันในเย็นวันหนึ่งมีน้ำอยู่ภายในที่ซึ่งสายไฟทั้งหมดออกมา ฉันทำความสะอาดมันแล้ววางเทปไว้ที่นั่นจนถึงตอนนี้ไม่มีรอยรั่ว สามีของฉันเพิ่งบอกฉันเมื่อวานนี้ว่าสิ่งนี้เป็นขยะ'}}

Data Fields

  • translation :
    • en : English product reviews generated by CTRL
    • th : Thai product reviews translated from en by Google Translate API
  • review_star : Stars of the generated reviews, put as condition for CTRL
  • correct : 1 if the English-to-Thai translation is accepted ( correct ) based on fluency and adequacy of the translation by human annotators else 0

Data Splits

train valid test
# samples 141369 15708 17453
# correct:0 99296 10936 12208
# correct:1 42073 4772 5245
# review_star:1 50418 5628 6225
# review_star:2 22876 2596 2852
# review_star:3 22825 2521 2831
# review_star:1 22671 2517 2778
# review_star:5 22579 2446 2767

Dataset Creation

Curation Rationale

generated_reviews_enth is created as part of scb-mt-en-th-2020 for machine translation task. This dataset (referred to as generated_reviews_yn in scb-mt-en-th-2020 ) are English product reviews generated by CTRL , translated by Google Translate API and annotated as accepted or rejected ( correct ) based on fluency and adequacy of the translation by human annotators. This allows it to be used for English-to-Thai translation quality esitmation (binary label), machine translation, and sentiment analysis.

Source Data

Initial Data Collection and Normalization

The data generation process is as follows:

  • en is generated using conditional generation of CTRL , stating a star review for each generated product review.
  • th is translated from en using Google Translate API
  • correct is annotated as accepted or rejected (1 or 0) based on fluency and adequacy of the translation by human annotators

For this specific dataset for translation quality estimation task, we apply the following preprocessing:

  • Drop duplciates on en , th , review_star , correct ; duplicates might exist because the translation checking is done by annotators.
  • Remove reviews that are not between 1-5 stars.
  • Remove reviews whose correct are not 0 or 1.
  • Deduplicate on en which contains the source sentences.
Who are the source language producers?

CTRL

Annotations

Annotation process

Annotators are given English and Thai product review pairs. They are asked to label the pair as acceptable translation or not based on fluency and adequacy of the translation.

Who are the annotators?

Human annotators of Hope Data Annotations hired by AIResearch.in.th

Personal and Sensitive Information

The authors do not expect any personal or sensitive information to be in the generated product reviews, but they could slip through from pretraining of CTRL .

Considerations for Using the Data

Social Impact of Dataset

  • English-Thai translation quality estimation for machine translation
  • Product review classification for Thai

Discussion of Biases

[More Information Needed]

Other Known Limitations

Due to annotation process constraints, the number of one-star reviews are notably higher than other-star reviews. This makes the dataset slighly imbalanced.

Additional Information

Dataset Curators

The dataset was created by AIResearch.in.th

Licensing Information

CC BY-SA 4.0

Citation Information

@article{lowphansirikul2020scb,
  title={scb-mt-en-th-2020: A Large English-Thai Parallel Corpus},
  author={Lowphansirikul, Lalita and Polpanumas, Charin and Rutherford, Attapol T and Nutanong, Sarana},
  journal={arXiv preprint arXiv:2007.03541},
  year={2020}
}

Contributions

Thanks to @cstorm125 for adding this dataset.