中文

Dataset Card for CVIT MKB

Dataset Summary

Indian Prime Minister's speeches - Mann Ki Baat, on All India Radio, translated into many languages.

Supported Tasks and Leaderboards

[MORE INFORMATION NEEDED]

Languages

Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English

Dataset Structure

Data Instances

[MORE INFORMATION NEEDED]

Data Fields

  • src_tag : string text in source language
  • tgt_tag : string translation of source language in target language

Data Splits

[MORE INFORMATION NEEDED]

Dataset Creation

Curation Rationale

[MORE INFORMATION NEEDED]

Source Data

[MORE INFORMATION NEEDED]

Initial Data Collection and Normalization

[MORE INFORMATION NEEDED]

Who are the source language producers?

[MORE INFORMATION NEEDED]

Annotations

Annotation process

[MORE INFORMATION NEEDED]

Who are the annotators?

[MORE INFORMATION NEEDED]

Personal and Sensitive Information

[MORE INFORMATION NEEDED]

Considerations for Using the Data

Social Impact of Dataset

[MORE INFORMATION NEEDED]

Discussion of Biases

[MORE INFORMATION NEEDED]

Other Known Limitations

[MORE INFORMATION NEEDED]

Additional Information

Dataset Curators

[MORE INFORMATION NEEDED]

Licensing Information

The datasets and pretrained models provided here are licensed under Creative Commons Attribution-ShareAlike 4.0 International License.

Citation Information

@misc{siripragada2020multilingual,
      title={A Multilingual Parallel Corpora Collection Effort for Indian Languages},
      author={Shashank Siripragada and Jerin Philip and Vinay P. Namboodiri and C V Jawahar},
      year={2020},
      eprint={2007.07691},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributions

Thanks to @vasudevgupta7 for adding this dataset.