数据集:
best2009
任务:
标记分类语言:
th计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found批注创建人:
expert-generated源数据集:
original许可:
cc-by-nc-sa-3.0best2009 is a Thai word-tokenization dataset from encyclopedia, novels, news and articles by NECTEC (148,995/2,252 lines of train/test). It was created for BEST 2010: Word Tokenization Competition . The test set answers are not provided publicly.
word tokenization
Thai
{'char': ['?', 'ภ', 'ู', 'ม', 'ิ', 'ป', 'ั', 'ญ', 'ญ', 'า', 'ช', 'า', 'ว', 'บ', '้', 'า', 'น', '\n'], 'char_type': [4, 1, 10, 1, 10, 1, 4, 1, 1, 10, 1, 10, 1, 1, 9, 10, 1, 4], 'fname': 'encyclopedia_00031.txt', 'is_beginning': [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1]} {'char': ['ภ', 'ู', 'ม', 'ิ', 'ป', 'ั', 'ญ', 'ญ', 'า', 'ช', 'า', 'ว', 'บ', '้', 'า', 'น', ' ', 'ห', 'ม', 'า', 'ย', 'ถ', 'ึ', 'ง', ' ', 'ค', 'ว', 'า', 'ม', 'ร', 'ู', '้', 'ข', 'อ', 'ง', 'ช', 'า', 'ว', 'บ', '้', 'า', 'น', ' ', 'ซ', 'ึ', '่', 'ง', 'เ', 'ร', 'ี', 'ย', 'น', 'ร', 'ู', '้', 'ม', 'า', 'จ', 'า', 'ก', 'พ', '่', 'อ', 'แ', 'ม', '่', ' ', 'ป', 'ู', '่', 'ย', '่', 'า', 'ต', 'า', 'ย', 'า', 'ย', ' ', 'ญ', 'า', 'ต', 'ิ', 'พ', 'ี', '่', 'น', '้', 'อ', 'ง', ' ', 'ห', 'ร', 'ื', 'อ', 'ผ', 'ู', '้', 'ม', 'ี', 'ค', 'ว', 'า', 'ม', 'ร', 'ู', '้', 'ใ', 'น', 'ห', 'ม', 'ู', '่', 'บ', '้', 'า', 'น', 'ใ', 'น', 'ท', '้', 'อ', 'ง', 'ถ', 'ิ', '่', 'น', 'ต', '่', 'า', 'ง', 'ๆ', '\n'], 'char_type': [1, 10, 1, 10, 1, 4, 1, 1, 10, 1, 10, 1, 1, 9, 10, 1, 5, 3, 1, 10, 1, 1, 10, 1, 5, 1, 1, 10, 1, 1, 10, 9, 1, 1, 1, 1, 10, 1, 1, 9, 10, 1, 5, 1, 10, 9, 1, 11, 1, 10, 1, 1, 1, 10, 9, 1, 10, 1, 10, 1, 1, 9, 1, 11, 1, 9, 5, 1, 10, 9, 1, 9, 10, 1, 10, 1, 10, 1, 5, 1, 10, 1, 10, 1, 10, 9, 1, 9, 1, 1, 5, 3, 1, 10, 1, 3, 10, 9, 1, 10, 1, 1, 10, 1, 1, 10, 9, 11, 1, 3, 1, 10, 9, 1, 9, 10, 1, 11, 1, 1, 9, 1, 1, 1, 10, 9, 1, 1, 9, 10, 1, 7, 4], 'fname': 'encyclopedia_00031.txt', 'is_beginning': [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1]}
train | test | |
---|---|---|
# lines | 148,995 | 2,252 |
avg words per line | 39.05 | NA |
total words | 5,818,521 | NA |
avg characters per line | 140.39 | 202.79 |
total characters | 20,918,132 | 456,684 |
# lines articles | 16,990 | NA |
# lines encyclopedia | 50,631 | NA |
# lines novels | 50,140 | NA |
# lines news | 31,234 | NA |
The dataset was created for BEST 2010: Word Tokenization Competition by NECTEC .
[More Information Needed]
Who are the source language producers?Respective authors of the articles, news, encyclopedia and novels
Detailed annotation guidelines can be found in BEST_Guideline_Release1.pdf as part of the uncompressed files. Word tokenization standard used was InterBEST2009
Who are the annotators?[More Information Needed]
All data are curated from public sources. No personal and sensitive information is expected to be included.
CC-BY-NC-SA 3.0
Dataset:
@inproceedings{kosawat2009best, title={BEST 2009: Thai word segmentation software contest}, author={Kosawat, Krit and Boriboon, Monthika and Chootrakool, Patcharika and Chotimongkol, Ananlada and Klaithin, Supon and Kongyoung, Sarawoot and Kriengket, Kanyanut and Phaholphinyo, Sitthaa and Purodakananda, Sumonmas and Thanakulwarapas, Tipraporn and others}, booktitle={2009 Eighth International Symposium on Natural Language Processing}, pages={83--88}, year={2009}, organization={IEEE} } @inproceedings{boriboon2009best, title={Best corpus development and analysis}, author={Boriboon, Monthika and Kriengket, Kanyanut and Chootrakool, Patcharika and Phaholphinyo, Sitthaa and Purodakananda, Sumonmas and Thanakulwarapas, Tipraporn and Kosawat, Krit}, booktitle={2009 International Conference on Asian Language Processing}, pages={322--327}, year={2009}, organization={IEEE} }
Character type features:
@inproceedings{haruechaiyasak2009tlex, title={TLex: Thai lexeme analyser based on the conditional random fields}, author={Haruechaiyasak, Choochart and Kongyoung, Sarawoot}, booktitle={Proceedings of 8th International Symposium on Natural Language Processing}, year={2009} }
Thanks to @cstorm125 for adding this dataset.