数据集:

alt

中文

Dataset Card for Asian Language Treebank (ALT)

Dataset Summary

The ALT project aims to advance the state-of-the-art Asian natural language processing (NLP) techniques through the open collaboration for developing and using ALT. It was first conducted by NICT and UCSY as described in Ye Kyaw Thu, Win Pa Pa, Masao Utiyama, Andrew Finch and Eiichiro Sumita (2016). Then, it was developed under ASEAN IVO as described in this Web page.

The process of building ALT began with sampling about 20,000 sentences from English Wikinews, and then these sentences were translated into the other languages.

Supported Tasks and Leaderboards

Machine Translation, Dependency Parsing

Languages

It supports 13 language:

  • Bengali
  • English
  • Filipino
  • Hindi
  • Bahasa Indonesia
  • Japanese
  • Khmer
  • Lao
  • Malay
  • Myanmar (Burmese)
  • Thai
  • Vietnamese
  • Chinese (Simplified Chinese).

Dataset Structure

Data Instances

ALT Parallel Corpus
{
    "SNT.URLID": "80188",
    "SNT.URLID.SNTID": "1",
    "url": "http://en.wikinews.org/wiki/2007_Rugby_World_Cup:_Italy_31_-_5_Portugal",
    "bg": "[translated sentence]",
    "en": "[translated sentence]",
    "en_tok": "[translated sentence]",
    "fil": "[translated sentence]",
    "hi": "[translated sentence]",
    "id": "[translated sentence]",
    "ja": "[translated sentence]",
    "khm": "[translated sentence]",
    "lo": "[translated sentence]",
    "ms": "[translated sentence]",
    "my": "[translated sentence]",
    "th": "[translated sentence]",
    "vi": "[translated sentence]",
    "zh": "[translated sentence]"
}
ALT Treebank
{
    "SNT.URLID": "80188",
    "SNT.URLID.SNTID": "1",
    "url": "http://en.wikinews.org/wiki/2007_Rugby_World_Cup:_Italy_31_-_5_Portugal",
    "status": "draft/reviewed",
    "value": "(S (S (BASENP (NNP Italy)) (VP (VBP have) (VP (VP (VP (VBN defeated) (BASENP (NNP Portugal))) (ADVP (RB 31-5))) (PP (IN in) (NP (BASENP (NNP Pool) (NNP C)) (PP (IN of) (NP (BASENP (DT the) (NN 2007) (NNP Rugby) (NNP World) (NNP Cup)) (PP (IN at) (NP (BASENP (NNP Parc) (FW des) (NNP Princes)) (COMMA ,) (BASENP (NNP Paris) (COMMA ,) (NNP France))))))))))) (PERIOD .))"
}
ALT Myanmar transliteration
{
    "en": "CASINO",
    "my": [
      "ကက်စီနို",
      "ကစီနို",
      "ကာစီနို",
      "ကာဆီနို"
    ]
}

Data Fields

ALT Parallel Corpus
  • SNT.URLID: URL link to the source article listed in URL.txt
  • SNT.URLID.SNTID: index number from 1 to 20000. It is a seletected sentence from SNT.URLID

and bg, en, fil, hi, id, ja, khm, lo, ms, my, th, vi, zh correspond to the target language

ALT Treebank
  • status: it indicates how a sentence is annotated; draft sentences are annotated by one annotater and reviewed sentences are annotated by two annotater

The annotatation is different from language to language, please see their guildlines for more detail.

Data Splits

train valid test
# articles 1698 98 97
# sentences 18088 1000 1018

Dataset Creation

Curation Rationale

The ALT project was initiated by the National Institute of Information and Communications Technology, Japan (NICT) in 2014. NICT started to build Japanese and English ALT and worked with the University of Computer Studies, Yangon, Myanmar (UCSY) to build Myanmar ALT in 2014. Then, the Badan Pengkajian dan Penerapan Teknologi, Indonesia (BPPT), the Institute for Infocomm Research, Singapore (I2R), the Institute of Information Technology, Vietnam (IOIT), and the National Institute of Posts, Telecoms and ICT, Cambodia (NIPTICT) joined to make ALT for Indonesian, Malay, Vietnamese, and Khmer in 2015.

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

The dataset is sampled from the English Wikinews in 2014. These will be annotated with word segmentation, POS tags, and syntax information, in addition to the word alignment information by linguistic experts from

  • National Institute of Information and Communications Technology, Japan (NICT) for Japanses and English
  • University of Computer Studies, Yangon, Myanmar (UCSY) for Myanmar
  • the Badan Pengkajian dan Penerapan Teknologi, Indonesia (BPPT) for Indonesian
  • the Institute for Infocomm Research, Singapore (I2R) for Malay
  • the Institute of Information Technology, Vietnam (IOIT) for Vietnamese
  • the National Institute of Posts, Telecoms and ICT, Cambodia for Khmer

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

  • National Institute of Information and Communications Technology, Japan (NICT) for Japanses and English
  • University of Computer Studies, Yangon, Myanmar (UCSY) for Myanmar
  • the Badan Pengkajian dan Penerapan Teknologi, Indonesia (BPPT) for Indonesian
  • the Institute for Infocomm Research, Singapore (I2R) for Malay
  • the Institute of Information Technology, Vietnam (IOIT) for Vietnamese
  • the National Institute of Posts, Telecoms and ICT, Cambodia for Khmer

Licensing Information

Creative Commons Attribution 4.0 International (CC BY 4.0)

Citation Information

Please cite the following if you make use of the dataset:

Hammam Riza, Michael Purwoadi, Gunarso, Teduh Uliniansyah, Aw Ai Ti, Sharifah Mahani Aljunied, Luong Chi Mai, Vu Tat Thang, Nguyen Phuong Thai, Vichet Chea, Rapid Sun, Sethserey Sam, Sopheap Seng, Khin Mar Soe, Khin Thandar Nwet, Masao Utiyama, Chenchen Ding. (2016) "Introduction of the Asian Language Treebank" Oriental COCOSDA.

BibTeX:

@inproceedings{riza2016introduction,
  title={Introduction of the asian language treebank},
  author={Riza, Hammam and Purwoadi, Michael and Uliniansyah, Teduh and Ti, Aw Ai and Aljunied, Sharifah Mahani and Mai, Luong Chi and Thang, Vu Tat and Thai, Nguyen Phuong and Chea, Vichet and Sam, Sethserey and others},
  booktitle={2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA)},
  pages={1--6},
  year={2016},
  organization={IEEE}
}

Contributions

Thanks to @chameleonTK for adding this dataset.