数据集:
lst20
任务:
标记分类语言:
th计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
expert-generated源数据集:
original许可:
otherLST20 Corpus is a dataset for Thai language processing developed by National Electronics and Computer Technology Center (NECTEC), Thailand. It offers five layers of linguistic annotation: word boundaries, POS tagging, named entities, clause boundaries, and sentence boundaries. At a large scale, it consists of 3,164,002 words, 288,020 named entities, 248,181 clauses, and 74,180 sentences, while it is annotated with 16 distinct POS tags. All 3,745 documents are also annotated with one of 15 news genres. Regarding its sheer size, this dataset is considered large enough for developing joint neural models for NLP. Manually download at https://aiforthai.in.th/corpus.php See LST20 Annotation Guideline.pdf and LST20 Brief Specification.pdf within the downloaded AIFORTHAI-LST20Corpus.tar.gz for more details.
Thai
{'clause_tags': [1, 2, 2, 2, 2, 2, 2, 2, 3], 'fname': 'T11964.txt', 'id': '0', 'ner_tags': [8, 0, 0, 0, 0, 0, 0, 0, 25], 'pos_tags': [0, 0, 0, 1, 0, 8, 8, 8, 0], 'tokens': ['ธรรมนูญ', 'แชมป์', 'สิงห์คลาสสิก', 'กวาด', 'รางวัล', 'แสน', 'สี่', 'หมื่น', 'บาท']} {'clause_tags': [1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3], 'fname': 'T11964.txt', 'id': '1', 'ner_tags': [8, 18, 28, 0, 0, 0, 0, 6, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 15, 25, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 0, 6], 'pos_tags': [0, 2, 0, 2, 1, 1, 2, 8, 2, 10, 2, 8, 2, 1, 0, 1, 0, 4, 7, 1, 0, 2, 8, 2, 10, 1, 10, 4, 2, 8, 2, 4, 0, 4, 0, 2, 8, 2, 10, 2, 8], 'tokens': ['ธรรมนูญ', '_', 'ศรีโรจน์', '_', 'เก็บ', 'เพิ่ม', '_', '4', '_', 'อันเดอร์พาร์', '_', '68', '_', 'เข้า', 'ป้าย', 'รับ', 'แชมป์', 'ใน', 'การ', 'เล่น', 'อาชีพ', '_', '19', '_', 'ปี', 'เป็น', 'ครั้ง', 'ที่', '_', '8', '_', 'ใน', 'ชีวิต', 'ด้วย', 'สกอร์', '_', '18', '_', 'อันเดอร์พาร์', '_', '270']}
train | eval | test | all | |
---|---|---|---|---|
words | 2,714,848 | 240,891 | 207,295 | 3,163,034 |
named entities | 246,529 | 23,176 | 18,315 | 288,020 |
clauses | 214,645 | 17,486 | 16,050 | 246,181 |
sentences | 63,310 | 5,620 | 5,250 | 74,180 |
distinct words | 42,091 | (oov) 2,595 | (oov) 2,006 | 46,692 |
breaking spaces※ | 63,310 | 5,620 | 5,250 | 74,180 |
non-breaking spaces※※ | 402,380 | 39,920 | 32,204 | 475,504 |
※ Breaking space = space that is used as a sentence boundary marker ※※ Non-breaking space = space that is not used as a sentence boundary marker
[More Information Needed]
[More Information Needed]
Who are the source language producers?Respective authors of the news articles
Detailed annotation guideline can be found in LST20 Annotation Guideline.pdf .
Who are the annotators?[More Information Needed]
All texts are from public news. No personal and sensitive information is expected to be included.
Any non-commercial use of the dataset for research and open-sourced projects is encouraged and free of charge. Please cite our technical report for reference.
If you want to perpetuate your models trained on our dataset and share them to the research community in Thailand, please send your models, code, and APIs to the AI for Thai Project. Please contact Dr. Thepchai Supnithi via thepchai@nectec.or.th for more information.
Note that modification and redistribution of the dataset by any means are strictly prohibited unless authorized by the corpus authors.
In any commercial use of the dataset, there are two options.
Option 1 (in kind): Contributing a dataset of 50,000 words completely annotated with our annotation scheme within 1 year. Your data will also be shared and recognized as a dataset co-creator in the research community in Thailand.
Option 2 (in cash): Purchasing a lifetime license for the entire dataset is required. The purchased rights of use cover only this dataset.
In both options, please contact Dr. Thepchai Supnithi via thepchai@nectec.or.th for more information.
@article{boonkwan2020annotation, title={The Annotation Guideline of LST20 Corpus}, author={Boonkwan, Prachya and Luantangsrisuk, Vorapon and Phaholphinyo, Sitthaa and Kriengket, Kanyanat and Leenoi, Dhanon and Phrombut, Charun and Boriboon, Monthika and Kosawat, Krit and Supnithi, Thepchai}, journal={arXiv preprint arXiv:2008.05055}, year={2020} }
Thanks to @cstorm125 for adding this dataset.