数据集:
pib
计算机处理:
translation语言创建人:
other批注创建人:
no-annotation源数据集:
original预印本库:
arxiv:2008.04860许可:
cc-by-4.0This dataset is the large scale sentence aligned corpus in 11 Indian languages, viz. CVIT-PIB corpus that is the largest multilingual corpus available for Indian languages.
Parallel data for following languages [en, bn, gu, hi, ml, mr, pa, or, ta, te, ur] are covered.
An example for the "gu-pa" language pair:
{ 'translation': { 'gu': 'એવો નિર્ણય લેવાયો હતો કે ખંતપૂર્વકની કામગીરી હાથ ધરવા, કાયદેસર અને ટેકનિકલ મૂલ્યાંકન કરવા, વેન્ચર કેપિટલ ઇન્વેસ્ટમેન્ટ સમિતિની બેઠક યોજવા વગેરે એઆઇએફને કરવામાં આવેલ પ્રતિબદ્ધતાના 0.50 ટકા સુધી અને બાકીની રકમ એફએફએસને પૂર્ણ કરવામાં આવશે.', 'pa': 'ਇਹ ਵੀ ਫੈਸਲਾ ਕੀਤਾ ਗਿਆ ਕਿ ਐੱਫਆਈਆਈ ਅਤੇ ਬਕਾਏ ਲਈ ਕੀਤੀਆਂ ਗਈਆਂ ਵਚਨਬੱਧਤਾਵਾਂ ਦੇ 0.50 % ਦੀ ਸੀਮਾ ਤੱਕ ਐੱਫਈਐੱਸ ਨੂੰ ਮਿਲਿਆ ਜਾਏਗਾ, ਇਸ ਨਾਲ ਉੱਦਮ ਪੂੰਜੀ ਨਿਵੇਸ਼ ਕਮੇਟੀ ਦੀ ਬੈਠਕ ਦਾ ਆਯੋਜਨ ਉਚਿਤ ਸਾਵਧਾਨੀ, ਕਾਨੂੰਨੀ ਅਤੇ ਤਕਨੀਕੀ ਮੁੱਲਾਂਕਣ ਲਈ ਸੰਚਾਲਨ ਖਰਚ ਆਦਿ ਦੀ ਪੂਰਤੀ ਹੋਵੇਗੀ।' } }
The dataset is in a single "train" split.
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Creative Commons Attribution-ShareAlike 4.0 International license.
@inproceedings{siripragada-etal-2020-multilingual, title = "A Multilingual Parallel Corpora Collection Effort for {I}ndian Languages", author = "Siripragada, Shashank and Philip, Jerin and Namboodiri, Vinay P. and Jawahar, C V", booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2020.lrec-1.462", pages = "3743--3751", language = "English", ISBN = "979-10-95546-34-4", } @article{2020, title={Revisiting Low Resource Status of Indian Languages in Machine Translation}, url={http://dx.doi.org/10.1145/3430984.3431026}, DOI={10.1145/3430984.3431026}, journal={8th ACM IKDD CODS and 26th COMAD}, publisher={ACM}, author={Philip, Jerin and Siripragada, Shashank and Namboodiri, Vinay P. and Jawahar, C. V.}, year={2020}, month={Dec} }
Thanks to @vasudevgupta7 for adding this dataset, and @albertvillanova for updating its version.