Urdu_DW-BBC-512

Dataset Summary

Urdu Summarization Dataset containining 76,637 records of Article + Summary pairs scrapped from BBC Urdu and DW Urdu News Websites. -Preprocessed Version upto 512 tokens (~words); removed URLs, Pic Captions etc

Supported Tasks and Leaderboards

Summarization -Extractive and Abstractive -urT5 (monolingual vocabulary; Urdu of 40k tokens) adapted from mT5 with own vocabulary was fine-tuned -ROUGE-1 F Score: 40.03 combined, 46.35 BBC Urdu datapoints only and 36.91 DW Urdu datapoints only -BERTScore: 75.1 combined, 77.0 BBC Urdu datapoints only and 74.16 DW Urdu datapoints only

Languages

Urdu.

Dataset Structure

Data Instances

[More Information Needed]

Data Fields

- url: URL of the article from where it was scrapped (BBC Urdu URLs in english topic text with number & DW Urdu with Urdu topic text)
  dtype: {string}
- Summary: Short Summary of article written by author of article like highlights.
  dtype: {string}
- Text: Complete Text of article which are intelligently trucated to 512 tokens.
  dtype: {string}

Data Splits

[More Information Needed]

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Considerations for Using the Data

Licensing Information

[More Information Needed]

Citation Information

[More Information Needed]

Contributions

[More Information Needed]

作者:

mbshr

数据集大小:

198.87 MB