数据集:
thainer
ThaiNER (v1.3) is a 6,456-sentence named entity recognition dataset created from expanding the 2,258-sentence unnamed dataset by Tirasaroj and Aroonmanakun (2012) . It is used to train NER taggers in PyThaiNLP . The NER tags are annotated by Tirasaroj and Aroonmanakun (2012) for 2,258 sentences and the rest by @wannaphong . The POS tags are done by PyThaiNLP 's perceptron engine trained on orchid_ud . @wannaphong is now the only maintainer of this dataset.
Thai
{'id': 100, 'ner_tags': [27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27], 'pos_tags': [6, 12, 13, 1, 6, 5, 11, 7, 11, 6, 5, 13, 6, 6, 6, 11, 6, 6, 11, 6, 6, 11, 6, 6, 13, 6, 11, 11, 6, 11, 6, 11, 6, 11, 6, 11, 11, 6, 6, 11, 12, 6, 13, 5, 11, 7, 11, 6, 3, 11, 12, 3, 13, 6, 1, 6, 12, 13, 1, 6, 6, 5, 11, 3, 11, 5, 4, 6, 13, 6, 13, 6, 10, 3, 13, 13, 12, 13, 12, 0, 1, 10, 11, 6, 6, 11, 6, 11, 6, 12, 13, 5, 12, 3, 13, 13, 1, 6, 1, 6, 13], 'tokens': ['เชื้อโรค', 'ที่', 'ปรากฏ', 'ใน', 'สัตว์', 'ทั้ง', ' ', '4', ' ', 'ชนิด', 'นี้', 'เป็น', 'เชื้อ', 'โรคไข้หวัด', 'นก', ' ', 'เอช', 'พี', ' ', 'เอ', 'เวียน', ' ', 'อิน', 'ฟลู', 'เอน', 'ซา', ' ', '(', 'Hight', ' ', 'Polygenic', ' ', 'Avain', ' ', 'Influenza', ')', ' ', 'ชนิด', 'รุนแรง', ' ', 'ซึ่ง', 'การ', 'ตั้งชื่อ', 'ทั้ง', ' ', '4', ' ', 'ขึ้น', 'มา', ' ', 'เพื่อที่จะ', 'สามารถ', 'ระบุ', 'เชื้อ', 'ของ', 'ไวรัส', 'ที่', 'ทำอันตราย', 'ตาม', 'สิ่งมีชีวิต', 'ประเภท', 'ต่างๆ', ' ', 'ได้', ' ', 'อีก', 'ทั้ง', 'การ', 'ระบุ', 'สถานที่', 'คือ', 'ประเทศ', 'ไทย', 'จะ', 'ทำให้', 'รู้', 'ว่า', 'พบ', 'ที่', 'แรก', 'ใน', 'ไทย', ' ', 'ส่วน', 'วัน', ' ', 'เดือน', ' ', 'ปี', 'ที่', 'พบ', 'นั้น', 'ก็', 'จะ', 'ทำให้', 'ทราบ', 'ถึง', 'ครั้งแรก', 'ของ', 'การ', 'ค้นพบ']} {'id': 107, 'ner_tags': [27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27], 'pos_tags': [0, 1, 6, 5, 11, 12, 3, 3, 13, 6, 13, 12, 0, 2, 12, 11, 6, 5, 13, 6, 5, 1, 6, 6, 1, 10, 11, 4, 13, 6, 11, 12, 6, 6, 10, 11, 13, 6, 1, 6, 4, 6, 1, 6, 6, 11, 4, 6, 1, 5, 6, 12, 2, 13, 6, 6, 5, 1, 11, 12, 13, 1, 6, 6, 11, 13, 11, 6, 6, 6, 11, 11, 6, 11, 11, 4, 10, 11, 11, 6, 11], 'tokens': ['ล่าสุด', 'ใน', 'เรื่อง', 'นี้', ' ', 'ทั้งนี้', 'คง', 'ต้อง', 'มี', 'การ', 'ตรวจสอบ', 'ให้', 'ชัดเจน', 'อีกครั้ง', 'ว่า', ' ', 'ไวรัส', 'นี้', 'เป็น', 'ชนิด', 'เดียว', 'กับ', 'ไข้หวัด', 'นก', 'ใน', 'ไทย', ' ', 'หรือ', 'เป็น', 'การกลายพันธุ์', ' ', 'โดยที่', 'คณะ', 'สัตวแพทย์', 'มหาวิทยาลัยเกษตรศาสตร์', ' ', 'จัด', 'ระดมสมอง', 'จาก', 'คณบดี', 'และ', 'ผู้เชี่ยวชาญ', 'จาก', 'คณะ', 'สัตวแพทย์', ' ', 'และ', 'ปศุสัตว์', 'ของ', 'หลาย', 'มหาวิทยาลัย', 'เพื่อ', 'ร่วมกัน', 'หา', 'ข้อมูล', 'เรื่อง', 'นี้', 'ด้วย', ' ', 'โดย', 'ประสาน', 'กับ', 'เจ้าหน้าที่', 'ระหว่างประเทศ', ' ', 'คือ', ' ', 'องค์การ', 'สุขภาพ', 'สัตว์โลก', ' ', '(', 'OIE', ')', ' ', 'และ', 'องค์การอนามัยโลก', ' ', '(', 'WHO', ')']}
No explicit split is given
ThaiNER (v1.3) is a 6,456-sentence named entity recognition dataset created from expanding the 2,258-sentence unnamed dataset by Tirasaroj and Aroonmanakun (2012) . It is used to train NER taggers in PyThaiNLP .
The earlier part of the dataset is all news articles, whereas the part added by @wannaphong includes news articles, public announcements and @wannaphong 's own chat messages with personal and sensitive information removed.
Who are the source language producers?News articles and public announcements are created by their respective authors. Chat messages are created by @wannaphong .
[More Information Needed]
Who are the annotators?Tirasaroj and Aroonmanakun (2012) for the earlier 2,258 sentences and @wannaphong for the rest
News articles and public announcements are not expected to include personal and sensitive information. @wannaphong has removed such information from his own chat messages.
Since almost all of collection and annotation is done by @wannaphong , his biases are expected to be reflected in the dataset.
[More Information Needed]
Tirasaroj and Aroonmanakun (2012) for the earlier 2,258 sentences and @wannaphong for the rest
CC-BY 3.0
@misc{Wannaphong Phatthiyaphaibun_2019, title={wannaphongcom/thai-ner: ThaiNER 1.3}, url={https://zenodo.org/record/3550546}, DOI={10.5281/ZENODO.3550546}, abstractNote={Thai Named Entity Recognition}, publisher={Zenodo}, author={Wannaphong Phatthiyaphaibun}, year={2019}, month={Nov} }
Work extended from: Tirasaroj, N. and Aroonmanakun, W. 2012. Thai NER using CRF model based on surface features. In Proceedings of SNLP-AOS 2011, 9-10 February, 2012, Bangkok, pages 176-180.
Thanks to @cstorm125 for adding this dataset.