数据集:
wisesight1000
wisesight1000 contains Thai social media texts randomly drawn from the full wisesight-sentiment , tokenized by human annotators. Out of the labels neg (negative), neu (neutral), pos (positive), q (question), 250 samples each. Some texts are removed because they look like spam. Because these samples are representative of real world content, we believe having these annotaed samples will allow the community to robustly evaluate tokenization algorithms.
word tokenization
Thai
{'char': ['E', 'u', 'c', 'e', 'r', 'i', 'n', ' ', 'p', 'r', 'o', ' ', 'a', 'c', 'n', 'e', ' ', 'ค', '่', 'ะ', ' ', 'ใ', 'ช', '้', 'แ', 'ล', '้', 'ว', 'ส', 'ิ', 'ว', 'ข', 'ึ', '้', 'น', 'เ', 'พ', 'ิ', '่', 'ม', 'ท', 'ุ', 'ก', 'ว', 'ั', 'น', ' ', 'ม', 'า', 'ด', 'ู', 'ก', 'ั', 'น', 'น', 'ะ', 'ค', 'ะ', ' ', 'ว', '่', 'า', 'จ', 'ั', 'ด', 'ก', 'า', 'ร', 'ป', 'ั', 'ญ', 'ห', 'า', 'ส', 'ิ', 'ว', 'ใ', 'น', '7', 'ว', 'ั', 'น', 'ไ', 'ด', '้', 'ร', 'ึ', 'ม', 'ั', '่', 'ย', 'ย', 'ย', 'ย', 'ย', 'ย', 'ย', 'ย', ' ', 'ล', '่', 'า', 'ส', 'ุ', 'ด', 'ไ', 'ป', 'ล', '้', 'า', 'ง', 'ห', 'น', '้', '…', '\n'], 'char_type': [0, 8, 8, 8, 8, 8, 8, 5, 8, 8, 8, 5, 8, 8, 8, 8, 5, 1, 9, 10, 5, 11, 1, 9, 11, 1, 9, 1, 1, 10, 1, 1, 10, 9, 1, 11, 1, 10, 9, 1, 1, 10, 1, 1, 4, 1, 5, 1, 10, 1, 10, 1, 4, 1, 1, 10, 1, 10, 5, 1, 9, 10, 1, 4, 1, 1, 10, 1, 1, 4, 1, 3, 10, 1, 10, 1, 11, 1, 2, 1, 4, 1, 11, 1, 9, 1, 10, 1, 4, 9, 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 9, 10, 1, 10, 1, 11, 1, 1, 9, 10, 1, 3, 1, 9, 4, 4], 'is_beginning': [1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0]} {'char': ['แ', 'พ', 'ง', 'เ', 'ว', '่', 'อ', 'ร', '์', ' ', 'เ', 'บ', 'ี', 'ย', 'ร', '์', 'ช', '้', 'า', 'ง', 'ต', '้', 'น', 'ท', 'ุ', 'น', 'ข', 'ว', 'ด', 'ล', 'ะ', 'ไ', 'ม', '่', 'ถ', 'ึ', 'ง', ' ', '5', '0', ' ', 'ข', 'า', 'ย', ' ', '1', '2', '0', ' ', '?', '?', '?', '์', '\n'], 'char_type': [11, 1, 1, 11, 1, 9, 1, 1, 7, 5, 11, 1, 10, 1, 1, 7, 1, 9, 10, 1, 1, 9, 1, 1, 10, 1, 1, 1, 1, 1, 10, 11, 1, 9, 1, 10, 1, 5, 2, 2, 5, 1, 10, 1, 5, 2, 2, 2, 5, 4, 4, 4, 7, 4], 'is_beginning': [1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0]}
No explicit split is given.
The dataset was created from wisesight-sentiment to be a word tokenization benchmark that is closer to texts in the wild, since other Thai word tokenization datasets such as BEST are mostly texts from news articles, which do not have some real-world features like misspellings.
The data are sampled from wisesight-sentiment which has the following data collection and normalization:
Social media users in Thailand
[More Information Needed]
Who are the annotators?The annotation was done by several people, including Nitchakarn Chantarapratin, Pattarawat Chormai , Ponrawee Prasertsom , Jitkapat Sawatphol , Nozomi Yamada , and Attapol Rutherford .
[More Information Needed]
Thanks PyThaiNLP community, Kitsuchart Pasupa (Faculty of Information Technology, King Mongkut's Institute of Technology Ladkrabang), and Ekapol Chuangsuwanich (Faculty of Engineering, Chulalongkorn University) for advice. The original Kaggle competition, using the first version of this corpus, can be found at https://www.kaggle.com/c/wisesight-sentiment/
CC0
Dataset:
@software{bact_2019_3457447, author = {Suriyawongkul, Arthit and Chuangsuwanich, Ekapol and Chormai, Pattarawat and Polpanumas, Charin}, title = {PyThaiNLP/wisesight-sentiment: First release}, month = sep, year = 2019, publisher = {Zenodo}, version = {v1.0}, doi = {10.5281/zenodo.3457447}, url = {https://doi.org/10.5281/zenodo.3457447} }
Character type features:
@inproceedings{haruechaiyasak2009tlex, title={TLex: Thai lexeme analyser based on the conditional random fields}, author={Haruechaiyasak, Choochart and Kongyoung, Sarawoot}, booktitle={Proceedings of 8th International Symposium on Natural Language Processing}, year={2009} }
Thanks to @cstorm125 for adding this dataset.