数据集:

mkb

任务:

文本生成

填充掩码

子任务:

language-modeling masked-language-modeling

语言:

计算机处理:

translation

大小:

1K<n<10K n<1K

批注创建人:

no-annotation

源数据集:

original

预印本库:

arxiv:2007.07691

许可:

cc-by-4.0

数据集介绍文件清单

中文

Dataset Card for CVIT MKB

Dataset Summary

Indian Prime Minister's speeches - Mann Ki Baat, on All India Radio, translated into many languages.

Supported Tasks and Leaderboards

[MORE INFORMATION NEEDED]

Languages

Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English

Dataset Structure

Data Instances

[MORE INFORMATION NEEDED]

Data Fields

src_tag : string text in source language
tgt_tag : string translation of source language in target language

Data Splits

[MORE INFORMATION NEEDED]

Dataset Creation

Curation Rationale

[MORE INFORMATION NEEDED]

Source Data

[MORE INFORMATION NEEDED]

Initial Data Collection and Normalization

[MORE INFORMATION NEEDED]

Who are the source language producers?

[MORE INFORMATION NEEDED]

Annotations

Annotation process

[MORE INFORMATION NEEDED]

Who are the annotators?

[MORE INFORMATION NEEDED]

Personal and Sensitive Information

[MORE INFORMATION NEEDED]

Considerations for Using the Data

Social Impact of Dataset

[MORE INFORMATION NEEDED]

Discussion of Biases

[MORE INFORMATION NEEDED]

Other Known Limitations

[MORE INFORMATION NEEDED]

Additional Information

Dataset Curators

[MORE INFORMATION NEEDED]

Licensing Information

The datasets and pretrained models provided here are licensed under Creative Commons Attribution-ShareAlike 4.0 International License.

Citation Information

@misc{siripragada2020multilingual,
      title={A Multilingual Parallel Corpora Collection Effort for Indian Languages},
      author={Shashank Siripragada and Jerin Philip and Vinay P. Namboodiri and C V Jawahar},
      year={2020},
      eprint={2007.07691},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributions

Thanks to @vasudevgupta7 for adding this dataset.

作者:

佚名

数据集大小:

76.26 KB