数据集:

id_panl_bppt

任务:

翻译

语言:

en id

计算机处理:

translation

大小:

10K<n<100K

语言创建人:

expert-generated

批注创建人:

expert-generated

源数据集:

original
中文

Dataset Card for [Dataset Name]

Dataset Summary

Parallel Text Corpora for Multi-Domain Translation System created by BPPT (Indonesian Agency for the Assessment and Application of Technology) for PAN Localization Project (A Regional Initiative to Develop Local Language Computing Capacity in Asia). The dataset contains around 24K sentences divided in 4 difference topics (Economic, international, Science and Technology and Sport).

Supported Tasks and Leaderboards

[More Information Needed]

Languages

Indonesian

Dataset Structure

[More Information Needed]

Data Instances

An example of the dataset:

{ 
  'id': '0',
  'topic': 0,
  'translation':
    { 
      'en': 'Minister of Finance Sri Mulyani Indrawati said that a sharp correction of the composite
inde x by up to 4 pct in Wedenesday?s trading was a mere temporary effect of regional factors like
decline in plantation commodity prices and the financial crisis in Thailand.',
      'id': 'Menteri Keuangan Sri Mulyani mengatakan koreksi tajam pada Indeks Harga Saham Gabungan
IHSG hingga sekitar 4 persen dalam perdagangan Rabu 10/1 hanya efek sesaat dari faktor-faktor regional
seperti penurunan harga komoditi perkebunan dan krisis finansial di Thailand.'
    }
}

Data Fields

  • id : id of the sample
  • translation : the parallel sentence english-indonesian
  • topic : the topic of the sentence. It could be one of the following:
    • Economic
    • International
    • Science and Technology
    • Sport

Data Splits

The dataset is splitted in to train, validation and test sets.

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

@inproceedings{id_panl_bppt,
  author    = {PAN Localization - BPPT},
  title     = {Parallel Text Corpora, English Indonesian},
  year      = {2009},
  url       = {http://digilib.bppt.go.id/sampul/p92-budiono.pdf},
}

Contributions

Thanks to @cahya-wirawan for adding this dataset.