数据集:

id_puisi

语言:

id

计算机处理:

monolingual

大小:

1K<n<10K

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original

许可:

mit
中文

Dataset Card for id_puisi

Dataset Summary

Puisi (poem) is an Indonesian poetic form. The dataset contains 7223 Indonesian puisi with its title and author.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

Indonesian

Dataset Structure

Data Instances

{
  'puisi_with_header': 'TEPERANGKAP 
 Oleh Mangku Langit Jingga 
  
 Mungkin kau membiarkan aku 
 Membiarkan perasaan ini larut 
 Memberi ruang jiwaku hampa 
 Agar tetap terbiasa nikmati 
  
 Perangkap yang kau buat 
 Perisai yang kau banggakan 
 Takkan jadi tameng bagimu 
 Aku mengerti betapa hebatnya 
  
 Perangkap mu hei sang dewi 
 Ku akan terus merasa terbiasa 
 Dengan pesona indahmu 
 Ku masih akan nikmati hadirmu 
  
 Berjalanlah pada hati yang sama 
 Satu hati denganku 
 Walau ku terperangkap 
 Namunku nikmati dan jalani',

  'title': 'TEPERANGKAP',

  'author': 'Oleh Mangku Langit Jingga',

  'puisi': 'Mungkin kau membiarkan aku 
 Membiarkan perasaan ini larut 
 Memberi ruang jiwaku hampa 
 Agar tetap terbiasa nikmati 
  
 Perangkap yang kau buat 
 Perisai yang kau banggakan 
 Takkan jadi tameng bagimu 
 Aku mengerti betapa hebatnya 
  
 Perangkap mu hei sang dewi 
 Ku akan terus merasa terbiasa 
 Dengan pesona indahmu 
 Ku masih akan nikmati hadirmu 
  
 Berjalanlah pada hati yang sama 
 Satu hati denganku 
 Walau ku terperangkap 
 Namunku nikmati dan jalani',
}

Data Fields

  • puisi_with_header : the raw text from scraping
  • title : the title extracted from the raw text using regex
  • author : the author extracted from the raw text using regex
  • puisi : the poem with title and author extracted out using regex

Data Splits

The dataset contains only a train set.

Dataset Creation

Curation Rationale

The dataset was initially collected as an experiment to generate an Indonesian poem using GPT-2.

Source Data

Initial Data Collection and Normalization

The dataset was scraped using BeautifulSoup from lokerpuisi.web.id (the data no longer exist on the original blog). The title and author column was produced using regex match from puisi_with_header column.

Who are the source language producers?

The poems were generated by humans. The users of the original blog voluntarily submit their original poems to get published on the blog.

Annotations

Annotation process

[N/A]

Who are the annotators?

[N/A]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

The regex match used to extract the title & author from the raw text is not perfect. Some title & text is still failed to get extracted.

Additional Information

Dataset Curators

Ilham Firdausi Putra

Licensing Information

MIT License

Citation Information

[N/A]

Contributions

Thanks to @ilhamfp for adding this dataset.