数据集:
ccaligned_multilingual
CCAligned consists of parallel or comparable web-document pairs in 137 languages aligned with English. These web-document pairs were constructed by performing language identification on raw web-documents, and ensuring corresponding language codes were corresponding in the URLs of web documents. This pattern matching approach yielded more than 100 million aligned documents paired with English. Recognizing that each English document was often aligned to mulitple documents in different target language, we can join on English documents to obtain aligned documents that directly pair two non-English documents (e.g., Arabic-French). This corpus was created from 68 Commoncrawl Snapshots.
To load a language which isn't part of the config, all you need to do is specify the language code. You can find the valid languages in http://www.statmt.org/cc-aligned/ E.g.
dataset = load_dataset("ccaligned_multilingual", language_code="fr_XX", type="documents")
or
dataset = load_dataset("ccaligned_multilingual", language_code="fr_XX", type="sentences")
[Needs More Information]
The text in the dataset is in (137) multiple languages aligned with english.
An instance of documents type for language ak_GH :
{'Domain': 'islamhouse.com', 'Source_URL': 'https://islamhouse.com/en/audios/373088/', 'Target_URL': 'https://islamhouse.com/ak/audios/373088/', 'translation': {'ak_GH': "Ntwatiaa / wɔabɔ no tɔfa wɔ mu no te ase ma Umrah - Arab kasa|Islamhouse.com|Follow us:|facebook|twitter|taepe|Titles All|Fie wibesite|kasa nyina|Buukuu edi adanse ma prente|Nhyehyɛmu|Nyim/sua Islam|Curriculums|Nyina ndeɛma|Nyina ndeɛma (295)|Buukuu/ nwoma (2)|sini / muuvi (31)|ɔdio (262)|Aɛn websideNew!|Kɔ wura kramosom mu seisei|Ebio|figa/kaasɛ|Farebae|AKAkan|Kratafa titriw|kasa interface( anyimu) : Akan|Kasa ma no mu-nsɛm : Arab kasa|ɔdio|Ntwatiaa / wɔabɔ no tɔfa wɔ mu no te ase ma Umrah|play|pause|stop|mute|unmute|max volume|Kasakyerɛ ni :|Farebae:|17 / 11 / 1432 , 15/10/2011|Nhyehyɛmu:|Jurisprudence/ Esum Nimdea|Som|Hajj na Umrah|Jurisprudence/ Esum Nimdea|Som|Hajj na Umrah|Mmira ma Hajj na Umrah|nkyerɛmu|kasamu /sɛntɛns ma te ase na Umrah wɔ ... mu no hann ma no Quran na Sunnah na te ase ma no nana na no kasamu /sɛntɛns ma bi ma no emerging yi adu obusuani|Akenkane we ye di ko kasa bi su (36)|Afar - Qafár afa|Akan|Amhari ne - አማርኛ|Arab kasa - عربي|Assamese - অসমীয়া|Bengali - বাংলা|Maldive - ދިވެހި|Greek - Ελληνικά|English ( brofo kasa) - English|Persian - فارسی|Fula - pulla|French - Français|Hausa - Hausa|Kurdish - كوردی سۆرانی|Uganda ne - Oluganda|Mandinka - Mandinko|Malayalam - മലയാളം|Nepali - नेपाली|Portuguese - Português|Russian - Русский|Sango - Sango|Sinhalese - සිංහල|Somali - Soomaali|Albania ne - Shqip|Swahili - Kiswahili|Telugu - తెలుగు ప్రజలు|Tajik - Тоҷикӣ|Thai - ไทย|Tagalog - Tagalog|Turkish - Türkçe|Uyghur - ئۇيغۇرچە|Urdu - اردو|Uzbeck ne - Ўзбек тили|Vietnamese - Việt Nam|Wolof - Wolof|Chine ne - 中文|Soma kɔ bi kyerɛ adwen kɔ wɛb ebusuapanin|Soma kɔ ne kɔ hom adamfo|Soma kɔ bi kyerɛ adwen kɔ wɛb ebusuapanin|Nsɔwso fael (1)|1|الموجز في فقه العمرة|MP3 14.7 MB|Enoumah ebatahu|Rituals/Esom ajomadie ewu Hajji mmire .. 1434 AH [01] no fapemso Enum|Fiidbak/ Ye hiya wu jun kyiri|Lenke de yɛe|kɔntakt yɛn|Aɛn webside|Qura'an Kro kronkrom|Balagh|wɔ mfinimfin Dowload faele|Yɛ atuu bra Islam mu afei|Tsin de yɛe ewu|Anaa bomu/combine hɛn melin liste|© Islamhouse Website/ Islam dan webi site|×|×|Yi mu kasa|", 'en_XX': 'SUMMARY in the jurisprudence of Umrah - Arabic - Abdul Aziz Bin Marzooq Al-Turaifi|Islamhouse.com|Follow us:|facebook|twitter|QuranEnc.com|HadeethEnc.com|Type|Titles All|Home Page|All Languages|Categories|Know about Islam|All items|All items (4057)|Books (701)|Articles (548)|Fatawa (370)|Videos (1853)|Audios (416)|Posters (98)|Greeting cards (22)|Favorites (25)|Applications (21)|Desktop Applications (3)|To convert to Islam now !|More|Figures|Sources|Curriculums|Our Services|QuranEnc.com|HadeethEnc.com|ENEnglish|Main Page|Interface Language : English|Language of the content : Arabic|Audios|تعريب عنوان المادة|SUMMARY in the jurisprudence of Umrah|play|pause|stop|mute|unmute|max volume|Lecturer : Abdul Aziz Bin Marzooq Al-Turaifi|Sources:|AlRaya Islamic Recoding in Riyadh|17 / 11 / 1432 , 15/10/2011|Categories:|Islamic Fiqh|Fiqh of Worship|Hajj and Umrah|Islamic Fiqh|Fiqh of Worship|Hajj and Umrah|Pilgrimage and Umrah|Description|SUMMARY in jurisprudence of Umrah: A statement of jurisprudence and Umrah in the light of the Quran and Sunnah and understanding of the Ancestors and the statement of some of the emerging issues related to them.|This page translated into (36)|Afar - Qafár afa|Akane - Akan|Amharic - አማርኛ|Arabic - عربي|Assamese - অসমীয়া|Bengali - বাংলা|Maldivi - ދިވެހި|Greek - Ελληνικά|English|Persian - فارسی|Fula - pulla|French - Français|Hausa - Hausa|kurdish - كوردی سۆرانی|Ugandan - Oluganda|Mandinka - Mandinko|Malayalam - മലയാളം|Nepali - नेपाली|Portuguese - Português|Russian - Русский|Sango - Yanga ti Sango|Sinhalese - සිංහල|Somali - Soomaali|Albanian - Shqip|Swahili - Kiswahili|Telugu - తెలుగు|Tajik - Тоҷикӣ|Thai - ไทย|Tagalog - Tagalog|Turkish - Türkçe|Uyghur - ئۇيغۇرچە|Urdu - اردو|Uzbek - Ўзбек тили|Vietnamese - Việt Nam|Wolof - Wolof|Chinese - 中文|Send a comment to Webmaster|Send to a friend?|Send a comment to Webmaster|Attachments (1)|1|الموجز في فقه العمرة|MP3 14.7 MB|The relevant Material|The rituals of the pilgrimage season .. 1434 AH [ 01] the fifth pillar|The Quality of the Accepted Hajj (Piligrimage) and Its Limitations|Easy Path to the Rules of the Rites of Hajj|A Call to the Pilgrims of the Scared House of Allah|More|feedback|Important links|Contact us|Privacy policy|Islam Q&A|Learning Arabic Language|About Us|Convert To Islam|Noble Quran encyclopedia|IslamHouse.com Reader|Encyclopedia of Translated Prophetic Hadiths|Our Services|The Quran|Balagh|Center for downloading files|To embrace Islam now...|Follow us through|Or join our mailing list.|© Islamhouse Website|×|×|Choose language|'}}
An instance of sentences type for language ak_GH :
{'LASER_similarity': 1.4549942016601562, 'translation': {'ak_GH': 'Salah (nyamefere) ye Mmerebeia', 'en_XX': 'What he dislikes when fasting (10)'}}
For documents type:
For sentences type:
Split sizes of some small configurations:
name | train |
---|---|
documents-zz_TR | 41 |
sentences-zz_TR | 34 |
documents-tz_MA | 4 |
sentences-tz_MA | 33 |
documents-ak_GH | 249 |
sentences-ak_GH | 478 |
[Needs More Information]
[Needs More Information]
Who are the source language producers?[Needs More Information]
[Needs More Information]
Who are the annotators?[Needs More Information]
The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset.
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
@inproceedings{elkishky_ccaligned_2020, author = {El-Kishky, Ahmed and Chaudhary, Vishrav and Guzm{\'a}n, Francisco and Koehn, Philipp}, booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)}, month = {November}, title = {{CCAligned}: A Massive Collection of Cross-lingual Web-Document Pairs}, year = {2020} address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.emnlp-main.480", doi = "10.18653/v1/2020.emnlp-main.480", pages = "5960--5969" }
Thanks to @gchhablani for adding this dataset.