数据集:
tlc
Thai Literature Corpora (TLC): Corpora of machine-ingestible Thai classical literature texts.
It consists of two datasets:
It is texts from Vajirayana Digital Library , stored by chapters and stanzas (non-tokenized).
tlc v.2.0 (6/17/19 : a total of 34 documents, 292,270 lines, 31,790,734 characters) tlc v.1.0 (6/11/19 : a total of 25 documents, 113,981 lines, 28,775,761 characters)
It is texts from Thai National Historical Corpus, stored by lines (manually tokenized).
tnhc v.1.0 (6/25/19 : a total of 47 documents, 756,478 lines, 13,361,142 characters)
Language Modeling, Language Generation
Thai
{
"ch_num": "๑",
"title": "กากี กลอนสุภาพ",
"text": [
[
"๏ จักกล่าวอดีตนิทานแต่ปางก่อน\n",
"เมื่อครั้งองค์สมเด็จพระชินวร\tยังสัญจรแสวงหาโพธิญาณ\n",
"เสวยชาติเป็นสกุณาพระยานก\tจึงชักเรื่องชาดกมาบรรหาร\n",
"หวังแสดงแห่งจิตหญิงพาล\tให้ชายชาญรู้เชิงกระสัตรี ฯ\n"
]
}
tlc v.2.0 (6/17/19 : a total of 34 documents, 292,270 lines, 31,790,734 characters) tlc v.1.0 (6/11/19 : a total of 25 documents, 113,981 lines, 28,775,761 characters)
It is texts from Thai National Historical Corpus, stored by lines (manually tokenized).
tnhc v.1.0 (6/25/19 : a total of 47 documents, 756,478 lines, 13,361,142 characters)
| tlc2.0 | tlc1.0 | tnhc | |
|---|---|---|---|
| # documents | 34 | 25 | 47 |
| # lines | 292,270 | 113,981 | 756,478 |
Originally, the dataset was compiled for the Thai Poetry Generator at Chulalongkorn university as the Final project for 2209372 Introduction to Computational Linguistics by Jitkapat Sawatphol (Faculty of Engineering, Chulalongkorn University).
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
There is no personal information.
[More Information Needed]
[More Information Needed]
[More Information Needed]
Thanks Jitkapat Sawatphol (Faculty of Arts, Chulalongkorn University), and Attapol Rutherford (Faculty of Arts, Chulalongkorn University)
[More Information Needed]
Please cite the following if you make use of the dataset:
Jitkapat Sawatphol, and Attapol Rutherford. 2019. Thai Literature Corpora (TLC) .
BibTeX:
@misc{
author={Sawatphol, Jitkapat},
title={Thai Literature Corpora},
year={2019},
howpublished={\\url{https://attapol.github.io/tlc.html}}
}
Thanks to @chameleonTK for adding this dataset.