数据集:
alt
The ALT project aims to advance the state-of-the-art Asian natural language processing (NLP) techniques through the open collaboration for developing and using ALT. It was first conducted by NICT and UCSY as described in Ye Kyaw Thu, Win Pa Pa, Masao Utiyama, Andrew Finch and Eiichiro Sumita (2016). Then, it was developed under ASEAN IVO as described in this Web page.
The process of building ALT began with sampling about 20,000 sentences from English Wikinews, and then these sentences were translated into the other languages.
Machine Translation, Dependency Parsing
It supports 13 language:
{ "SNT.URLID": "80188", "SNT.URLID.SNTID": "1", "url": "http://en.wikinews.org/wiki/2007_Rugby_World_Cup:_Italy_31_-_5_Portugal", "bg": "[translated sentence]", "en": "[translated sentence]", "en_tok": "[translated sentence]", "fil": "[translated sentence]", "hi": "[translated sentence]", "id": "[translated sentence]", "ja": "[translated sentence]", "khm": "[translated sentence]", "lo": "[translated sentence]", "ms": "[translated sentence]", "my": "[translated sentence]", "th": "[translated sentence]", "vi": "[translated sentence]", "zh": "[translated sentence]" }ALT Treebank
{ "SNT.URLID": "80188", "SNT.URLID.SNTID": "1", "url": "http://en.wikinews.org/wiki/2007_Rugby_World_Cup:_Italy_31_-_5_Portugal", "status": "draft/reviewed", "value": "(S (S (BASENP (NNP Italy)) (VP (VBP have) (VP (VP (VP (VBN defeated) (BASENP (NNP Portugal))) (ADVP (RB 31-5))) (PP (IN in) (NP (BASENP (NNP Pool) (NNP C)) (PP (IN of) (NP (BASENP (DT the) (NN 2007) (NNP Rugby) (NNP World) (NNP Cup)) (PP (IN at) (NP (BASENP (NNP Parc) (FW des) (NNP Princes)) (COMMA ,) (BASENP (NNP Paris) (COMMA ,) (NNP France))))))))))) (PERIOD .))" }ALT Myanmar transliteration
{ "en": "CASINO", "my": [ "ကက်စီနို", "ကစီနို", "ကာစီနို", "ကာဆီနို" ] }
and bg, en, fil, hi, id, ja, khm, lo, ms, my, th, vi, zh correspond to the target language
ALT TreebankThe annotatation is different from language to language, please see their guildlines for more detail.
train | valid | test | |
---|---|---|---|
# articles | 1698 | 98 | 97 |
# sentences | 18088 | 1000 | 1018 |
The ALT project was initiated by the National Institute of Information and Communications Technology, Japan (NICT) in 2014. NICT started to build Japanese and English ALT and worked with the University of Computer Studies, Yangon, Myanmar (UCSY) to build Myanmar ALT in 2014. Then, the Badan Pengkajian dan Penerapan Teknologi, Indonesia (BPPT), the Institute for Infocomm Research, Singapore (I2R), the Institute of Information Technology, Vietnam (IOIT), and the National Institute of Posts, Telecoms and ICT, Cambodia (NIPTICT) joined to make ALT for Indonesian, Malay, Vietnamese, and Khmer in 2015.
[More Information Needed]
Who are the source language producers?The dataset is sampled from the English Wikinews in 2014. These will be annotated with word segmentation, POS tags, and syntax information, in addition to the word alignment information by linguistic experts from
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Creative Commons Attribution 4.0 International (CC BY 4.0)
Please cite the following if you make use of the dataset:
Hammam Riza, Michael Purwoadi, Gunarso, Teduh Uliniansyah, Aw Ai Ti, Sharifah Mahani Aljunied, Luong Chi Mai, Vu Tat Thang, Nguyen Phuong Thai, Vichet Chea, Rapid Sun, Sethserey Sam, Sopheap Seng, Khin Mar Soe, Khin Thandar Nwet, Masao Utiyama, Chenchen Ding. (2016) "Introduction of the Asian Language Treebank" Oriental COCOSDA.
BibTeX:
@inproceedings{riza2016introduction, title={Introduction of the asian language treebank}, author={Riza, Hammam and Purwoadi, Michael and Uliniansyah, Teduh and Ti, Aw Ai and Aljunied, Sharifah Mahani and Mai, Luong Chi and Thang, Vu Tat and Thai, Nguyen Phuong and Chea, Vichet and Sam, Sethserey and others}, booktitle={2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA)}, pages={1--6}, year={2016}, organization={IEEE} }
Thanks to @chameleonTK for adding this dataset.