数据集:
id_liputan6
任务:
摘要生成语言:
id计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found批注创建人:
no-annotation源数据集:
original预印本库:
arxiv:2011.00679许可:
license:unknownIn this paper, we introduce a large-scale Indonesian summarization dataset. We harvest articles from this http URL, an online news portal, and obtain 215,827 document-summary pairs. We leverage pre-trained language models to develop benchmark extractive and abstractive summarization methods over the dataset with multilingual and monolingual BERT-based models. We include a thorough error analysis by examining machine-generated summaries that have low ROUGE scores, and expose both issues with ROUGE it-self, as well as with extractive and abstractive summarization models.
The dataset has two variants: "canonical" and "xtreme". The "xtreme" variant discards development and test document–summary pairs where the summary has fewer than 90% novel 4-grams (the training data remains the same as the canonical variant).
You need to manually request the liputan6 dataset using the form in https://github.com/fajri91/sum_liputan6/ and uncompress it. The liputan6 dataset can then be loaded using the following command datasets.load_dataset("id_liputan6", 'canonical', data_dir="<path/to/uncompressed_folder>") or datasets.load_dataset("id_liputan6", 'xtreme', data_dir="<path/to/uncompressed_folder>") .
[More Information Needed]
Indonesian
{ 'id': 'string', 'url': 'string', 'clean_article': 'string', 'clean_article': 'string', 'extractive_summary': 'string' }
An example of the dataset:
{ 'clean_article': 'Liputan6.com, Ambon: Partai Bulan Bintang wilayah Maluku bertekad membantu pemerintah menyelesaikan konflik di provinsi tersebut. Syaratnya, penanganan penyelesaian konflik Maluku harus dimulai dari awal kerusuhan, yakni 19 Januari 1999. Demikian hasil Musyawarah Wilayah I PBB Maluku yang dimulai Sabtu pekan silam dan berakhir Senin (31/12) di Ambon. Menurut seorang fungsionaris PBB Ridwan Hasan, persoalan di Maluku bisa selesai asalkan pemerintah dan aparat keamanan serius menangani setiap persoalan di Maluku secara komprehensif dan bijaksana. Itulah sebabnya, PBB wilayah Maluku akan menjadikan penyelesaian konflik sebagai agenda utama partai. PBB Maluku juga akan mendukung penegakan hukum secara terpadu dan tanpa pandang bulu. Siapa saja yang melanggar hukum harus ditindak. Ridwan berharap, Ketua PBB Maluku yang baru, Ali Fauzi, dapat menindak lanjuti agenda politik partai yang telah diamanatkan dan mau mendukung penegakan hukum di Maluku. (ULF/Sahlan Heluth).', 'clean_summary': 'Konflik Ambon telah berlangsung selama tiga tahun. Partai Bulan Bintang wilayah Maluku siap membantu pemerintah menyelesaikan kasus di provinsi tersebut.', 'extractive_summary': 'Liputan6.com, Ambon: Partai Bulan Bintang wilayah Maluku bertekad membantu pemerintah menyelesaikan konflik di provinsi tersebut. Siapa saja yang melanggar hukum harus ditindak.', 'id': '26408', 'url': 'https://www.liputan6.com/news/read/26408/pbb-siap-membantu-penyelesaian-konflik-ambon' }
The dataset is splitted in to train, validation and test sets.
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
@inproceedings{Koto2020Liputan6AL, title={Liputan6: A Large-scale Indonesian Dataset for Text Summarization}, author={Fajri Koto and Jey Han Lau and Timothy Baldwin}, booktitle={AACL/IJCNLP}, year={2020} }
Thanks to @cahya-wirawan for adding this dataset.