数据集:
oscar
预印本库:
arxiv:2010.14571许可:
cc0-1.0源数据集:
original批注创建人:
no-annotation语言创建人:
found计算机处理:
multilingualOSCAR or O pen S uper-large C rawled A LMAnaCH co R pus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. Data is distributed by language in both original and deduplicated form.
The version here is the original OSCAR 2019 release: https://oscar-project.org/post/oscar-2019/
For more recent versions, visit the oscar-corpus organization on the Hub:
OSCAR is mainly inteded to pretrain language models and word represantations.
All the data is distributed by language, both the original and the deduplicated versions of the data are available. 166 different languages are available. The table in subsection Data Splits Sample Size provides the language code for each subcorpus as well as the number of words (space separated tokens), lines and sizes for both the original and the deduplicated versions of OSCAR.
We show detailed information for all the configurations of the dataset.
An example of 'train' looks as follows.
{ "id": 0, "text": "aanlyn markte as gevolg van ons voortgesette 'n begrip opsie handel sakeplan pdf terwyl ons steeds die gereelde ons binêre opsies handel" }unshuffled_deduplicated_als
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"De Nazionalpark hät e Flächi vo 170,3 km² und isch dodemit s grösti Naturschutzgebiet vo de Schwiz. Er ligt uf em Gebiet vo de ..." }unshuffled_deduplicated_am
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"አየር መንገዱ ከአዲስ አበባ ወደ ሮም ጣሊያን በማምራት ላይ በነበረበት ጊዜ ረዳት አብራሪው የጉዞውን አቅጣጫ በመቀየር ጄኔቭ አውሮፓላን ማረፊያ በማሳረፍ እጁን ለፖሊስ ሰጥቷል።\\nየኢትዮጵያ መንግስት የ..." }unshuffled_deduplicated_an
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"واااااااأسفاه الأمم تفتخر ب 0 أمي ووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووو..." }unshuffled_deduplicated_ar
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"مرحبا بك عزيز الزائر نتمنى لك أوقاتاً سعيدة معنا وأن نزداد شرفا بخدمتك ولا تنسى التسجيل معنا لتستفيد بكل جديد\\nأهلا وسهلا بك زا..." }unshuffled_deduplicated_arz
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"بنى عجل : قبيلة من عجل بن لجيم بن صعب بن على بن بكر بن وائل انتقل اغلبهم الى البصرة فى العراق و اصفهان و خراسان فى ايران و اذرب..." }unshuffled_deduplicated_as
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"আমি, এই সংগঠনৰ সদস্য সকলে একেলগ হৈ অসমকে ধৰি ভাৰতৰ উত্তৰ পূৰ্বাঞ্চলৰ অমূল্য কলা-সাংস্কৃতিক সম্পদৰাজি বৃহত্তৰ অষ্ট্ৰেলিয়াৰ সন্মু..." }unshuffled_deduplicated_ast
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"The Killers llanzaron el so álbum debú, Hot Fuss, en xunu de 2004 nel Reinu Xuníu, al traviés de la discográfica Lizard King, y..." }unshuffled_deduplicated_av
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Жинда малъараб ва божизе бегьулеб рагІудаса кьуризе бегьуларо гьев. Гьес насихІат гьабизе кколелъул бацІцІадаб диналъул рахъалъ..." }unshuffled_deduplicated_az
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"AZTV-Artıq 7 ildir ki, Abşeron rayonu dotasiya almadan bütün xərclərini yerli daxilolmalar hesabına maliyyələşdirir.\\nDünən, 10..." }unshuffled_deduplicated_azb
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"لعلی ١٣-جو عصرده یاشاییب یاراتمیش گؤرکملی آذربایجان شاعرلریندندیر. ١٢٢٤-جی ایلده تبریزده آنادان اولموشدور، گنج یاشلاریندا تیجار..." }unshuffled_deduplicated_ba
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Күҙәтеү ҡуласаһы моделен хәҙер Мифтахетдин Аҡмулла исемендәге Башҡорт дәүләт педагогия университетында ла эшләргә мөмкин\\t\\nКүҙ..." }unshuffled_deduplicated_bar
An example of 'train' looks as follows.
{ "id": 0, "text": " vo" }unshuffled_deduplicated_bcl
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"& ÿ ó / í 0 - ø û ù ö ú ð ï ú \\u0014 ù þ ô ö í ÷ ò \\u0014 ÷ í ù û ö í \\u0001 û ñ ç þ \\u0001 ð \\u0007 þ ò ñ ñ ò ô \\u0017 û ö ô ÷..." }unshuffled_deduplicated_be
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Брэсцкія ўлады не дазволілі прафсаюзу РЭП правесці пікетаванне ў парку Воінаў-інтэрнацыяналістаў 30 мая 2018 года.\\nСітуацыю пр..." }unshuffled_deduplicated_bg
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ЖАЛБОПОДАТЕЛЯТ директор на Дирекция „ Обжалване и данъчно-осигурителна практика“- Бургас, редовно призован, се представлява от ..." }unshuffled_deduplicated_bh
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"सुकमा जिला भारत के छत्तीसगढ़ राज्य में एगो जिला बाटे। एकर मुख्यालय सुकमा शहर बाटे। एकर कुल रकबा 5636 वर्ग कि॰मी॰ बाटे।\"..." }unshuffled_deduplicated_bn
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ভড়ং সর্বস্ব বাংলা আর্ট অ্যান্ড কালচারের হিসাব গুলিয়ে দেওয়ার ম্যাজিকের নাম ব্রাত্য রাইসু November 23, 2017\\nTagged with ডায়োজিনি..." }unshuffled_deduplicated_bo
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"བོད་མི་འདི་དག་ནི་རང་རྒྱུད་སྒོ་རུ་ཕུད་དེ་གཞན་རྒྱུད་པང་དུ་ཉར་ནས་གསོ་སྐྱོང་བྱེད་དགོས་ཟེར་བ་དང་གཅིག་མཚུངས་རེད།\\nཚན་རིག་ནི་དང་ཐོག་རང..." }unshuffled_deduplicated_bpy
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"পৌরসভা এহার আয়তন (লয়াহান) ২,৭৩০,.৬৩ বর্গ কিলোমিটার। পৌরসভা এহার মাপাহানর অক্ষাংশ বারো দ্রাঘিমাংশ ইলতাই 18.63° S 48.18° W ।[১]..." }unshuffled_deduplicated_br
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Ar mank Magalhães(Daveoù a vank) a zo ur spesad evned, Spheniscus magellanicus an anv skiantel anezhañ.\\nGallout a reer implijo..." }unshuffled_deduplicated_bs
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ž šř é ú šř šř ě šř ž é č ě ž ů ě ď éé ýš ě ě Ž č š ý ě ď é ýš ě ď ě éé ýš ě č ž ě š ý ď ě ýš é ú č ž č š ý ď ý ž é éě ď é č ýš..." }unshuffled_deduplicated_bxr
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"2002 оной хабар буряад хэлэ бэшэгэй һалбари Үндэһэтэнэй хүмүүнлиг ухаанай дээдэ һургуули болгогдожо өөршэлэгдөө.\\nХарин мүнөө б..." }unshuffled_deduplicated_ca
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Daniel Vendrell, conegut com Vandrell, ha sigut un dels il•lustradors contemporanis més influents, representant a la nova onada..." }unshuffled_deduplicated_cbk
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano..." }unshuffled_deduplicated_ce
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Шаьш анархисташ ду бохучу жигархойн дIахьедарехь дуьйцу, оьрсийн ницкъаллийн структурийн а, федералан каналан а Iалашонаш \\\"мар..." }unshuffled_deduplicated_ceb
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Si Isko walay pupamilok nga nagtan-aw sa unahan, natugaw. “Naunsa ka gud diha Isko nga layo man kaayo ang imong panan-aw?” ni I..." }unshuffled_deduplicated_ckb
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"رسی رۆژ - ساڵێک دوای بومەلەرزەی کرماشان میوانی بەرنامە : کاک سیاوەش حەیاتی چالاکی مەدەنی -قەسری شیرین\\nپارچە موزیک 30 / 10 / 20..." }unshuffled_deduplicated_cs
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Akce anarchistů proti připravovanému novému služební řádu a nízkým mzdám 1903 – Historie českého anarchismu (1880 – 1939)\\nRost..." }unshuffled_deduplicated_cv
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Шыранӑ чухне ӑнсӑртран латин кирилл саспаллисем вырӑнне латин саспаллисене ҫырсан, сайт эсир ҫырнине юсама тӑрӑшӗ.\\nКу сайтра ч..." }unshuffled_deduplicated_cy
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Mae capeli Cymreig yr Andes ym Mhatagonia wedi cyhoeddi na fydd gwasanaethau yno weddill y mis, oherwydd yr eira trwm sydd wedi..." }unshuffled_deduplicated_da
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Den 2.-5. februar 2016 løb det tredje kursus i uddannelsen af 4kommunesamarbejdets Local Impact Coaches, af stablen i Gentofte ..." }unshuffled_deduplicated_de
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Auf dieser Seite gibt es mind. ein YouTube Video. Cookies für diese Website wurden abgelehnt. Dadurch können keine YouTube Vide..." }unshuffled_deduplicated_diq
An example of 'train' looks as follows.
{ "id": 0, "text": "Zıwanê Slawki, zıwano merdumanê Slawano. Zıwanê Slawki yew lızgeyê Zıwananê Hind u Ewropao. Keyeyê Zıwananê Slawki beno hirê letey:" }unshuffled_deduplicated_dsb
An example of 'train' looks as follows.
{ "id": 1, "text": "Pśiklaskaju južo pśed pśedstajenim... 1500 źiśi njamóžo wěcej docakaś, měsćańska hala w Chóśebuzu - wupśedana." }unshuffled_deduplicated_dv
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ބ. އަތޮޅުގައި ހުޅުވަން ތައްޔާރުވަމުން އަންނަ ވައްކަރު ރިސޯޓުގައި ވަޒީފާ އަދާކުރަން ޝައުގުވެރިވާ ފަރާތްތަކަށް ކުރިމަތިލުމުގެ ފުރ..." }unshuffled_deduplicated_el
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Νεκρός εντοπίστηκε μέσα στο σπίτι του στην οδό Ηρώδου Αττικού στον αριθμό 7 ο επικεφαλής του προξενικού τμήματος της Ρωσικής πρ..." }unshuffled_deduplicated_eml
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"A séguit dal prucès ad rubutiśasiòṅ di abitànt dal pòpul ad Mikenes, Angoras 'l è finî dènt'r a 'n robot cun la tèsta dna rana ..." }unshuffled_deduplicated_en
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, which he shared with John Blanchard during his first visi..." }unshuffled_deduplicated_eo
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Ĉu ... preĝi | mediti | ricevi instigojn || kanti | muziki || informiĝi | legi | studi || prepari Diservon\\nTemas pri kolekto d..." }unshuffled_deduplicated_es
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Como se librará de la celulitis en el gimnasio La piel superflua en las manos después del adelgazamiento, Los bailes fáciles pa..." }unshuffled_deduplicated_et
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"MTÜ AB Video järgib oma tegevuses kodanikuühenduste eetilise tegevuse üldtunnustatud põhimõtteid, mis on lühidalt kokkuvõetud 7..." }unshuffled_deduplicated_eu
An example of 'train' looks as follows.
{ "id": 0, "text": "Gure jarduerek eraikuntzarekin, elkarbizitzarekin, hirigintzarekin eta ekologiarekin dute harremana, baita ideia eta konponbideak irudikatu eta garatzearekin ere, eraikuntza sektorea hobetuz, pertsonen erosotasuna eta bizi-kalitatea hobetzeko." }unshuffled_deduplicated_fa
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"قـــــــــــــــــرار بود با هم کنـــــــــــــار بیایم نه اینکه از کنــــــــــــار هم رد بشیم...!!!\\nاگر روزی دلت لبریز غم بو..." }unshuffled_deduplicated_fi
An example of 'train' looks as follows.
{ "id": 1, "text": "Kiitos Deelle kaikesta - 1,5 viikkoa kulunut, kun Dee ei ole enää ollut omani. Reilu viikko sitten sunnuntaina vein Deen uuteen kotiinsa. Itselläni on ollut niin ristiriitaiset t..." }unshuffled_deduplicated_fr
An example of 'train' looks as follows.
{ "id": 0, "text": "Média de débat d'idées, de culture et de littérature. Récits, décryptages, analyses, portraits et critiques autour de la vie des idées. Magazine engagé, ouvert aux autres et au monde.. Bring up to date in french" }unshuffled_deduplicated_frr
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Hiragana’ Practice’Sheet’1’(A -O)’ ’ Name:’________ __________________________’Section:’_______________ _’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ..." }unshuffled_deduplicated_fy
An example of 'train' looks as follows.
{ "id": 1, "text": "Nim in sêfte ride op Holmsjön, yn ien fan 'e lytse marren yn de omkriten, of nim se op avontueren lykas nonresidential. lâns Indalsälven wetter. Holm Sportklubb hawwe kano 's te huur, yn gearwurking mei de Baltyske Power konferinsje." }unshuffled_deduplicated_ga
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Is fóram é seo chun plé a dhéanamh ar an leabhar atá roghnaithe do mhí na Samhna 2013 amháin. Ní féidir ach le baill chláraithe..." }unshuffled_deduplicated_gd
An example of 'train' looks as follows.
{ "id": 0, "text": "Zhou Yujun, a 'phàrtaidh Rùnaire Comataidh Sgìre Yanfeng ann Hengyang bhaile agus a Sgìre pàrtaidh agus an riaghaltas a' bhuidheann-riochdachaidh a 'tighinn a chèilidh air ar companaidh air Apr. 14, 2017." }unshuffled_deduplicated_gl
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"O persoal de Inditex da provincia de Pontevedra segue a reclamar iguais condicións laborais no conxunto do país - CIG: Confeder..." }unshuffled_deduplicated_gn
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"º ÑÆÚÓ À Ã Ð É Æ ¾ ÄÂ Î À ¼ Æ É ÄÛ = Ü Ý\\\"Þ ßà á â ã ä å æçè ã é ê â å àë ì æê íî é á ë ï í çì àð í Ü à ñ ê é ò ä ì\"..." }unshuffled_deduplicated_gom
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"दुष्ट शीळ हें कौरवांचें । रामें सविस्तर देखूनि साचें । बोलिले वचनें जें दुर्वाचे । करी तयांचें अनुस्मरण ॥२२०॥\"..." }unshuffled_deduplicated_gu
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"અધિક માસ ચાલે છે. સમગ્ર ભારતમાં અને તેમાંય ખાસ કરીને પવિત્ર કે ધાર્મિક કહેવાય છે તેવા સ્થાનક પર કથાનો દોર ચાલે છે. ઉનાળાની કાળઝ..." }unshuffled_deduplicated_he
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"זקוקים לרשתות נגד יתושים? מחפשים רשת מתאימה לחלון צר וקטן? רשתות נגד יתושים אקורדיון של חברת קליר-מש הן הפתרון.\\nרשתות לחלונות ..." }unshuffled_deduplicated_hi
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"'आइटम गर्ल' बनकर हिट हुई थीं राखी सावंत, आज करीना-कटरीना तक फॉलो कर रही हैं ट्रेंड नक्सलियों का दम निकालेगा बाइक ग्रेनेड लॉन्च..." }unshuffled_deduplicated_hr
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"U raspravi je sudjelovao i HSS-ov saborski zastupnik rekavši kako poljoprivrednici ne osjete mjere o kojima ministar govori jer..." }unshuffled_deduplicated_hsb
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Budyšin (SN/BŠe). Elektronikarjo mějachu lětsa cyle hinaši zazběh do swojeho wukubłanja. Wokrjesne rjemjeslnistwo bě mjenujcy w..." }unshuffled_deduplicated_ht
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan..." }unshuffled_deduplicated_hu
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"monster - Amatőr, házi szex videók és kezdő csjaok pornó filmjei. - Free amateur, home made sex videos and online porn movies. ..." }unshuffled_deduplicated_hy
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Արցախի Հանրապետության հռչակման 26-րդ տարեդարձի կապակցությամբ Շուշիի Արվեստի կենտրոնում կազմակերպվել է մոսկվաբնակ նկարիչներ՝ հայ..." }unshuffled_deduplicated_ia
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha h..." }unshuffled_deduplicated_id
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Perihal dari itu, kalau kunci hal yang demikian hilang, pemilik wajib melapor ke bengkel sah untuk dibuatkan kunci baru dengan ..." }unshuffled_deduplicated_ie
An example of 'train' looks as follows.
{ "id": 0, "text": "Plastic Yo Yo Metal Yo Yos Wooden Yo Yo Keychain Yo Yo Translucent Yo Yo Light Up Yo Yo Globe Yo Yo Stress Reliever Yo Yo Jellyfish Yo Yo Sports Ball Yo Yo Sound Yo Yo Miniature Yo Yo Promotional Yo Yo Novelty Yo Yo Video Game Yo Yo ECO Recycled Yo Yo" }unshuffled_deduplicated_ilo
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Segun ken ni Ping-ay, ti yellow corn ti maysa kadagiti nadakamat a liberalized agricultural commodity iti daytoy a free trade k..." }unshuffled_deduplicated_io
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Chekia esas parlamentala republiko. La chefo di stato esas la prezidanto. Til 2013 lu elektesis dal parlamento. Pos ta yaro, ol..." }unshuffled_deduplicated_is
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Eyjar.net - upplýsinga- og fréttamiðill um Vestmannaeyjar - Fréttir - Nái núverandi stefna stjórnvalda fram að ganga mun það va..." }unshuffled_deduplicated_it
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Jaundice - causes, treatment & pathology massaggio a osteochondrosis dellindizio di una controindicazione\\nTrattamento su un co..." }unshuffled_deduplicated_ja
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"神社などへ一緒に同行して、様々な角度のショットで家族写真やお子様の写真を撮影致します!お好みに合わせて様々な写真を取ることができますので、その場でカメラマンへのリクエストも可能です!お子様の晴れ姿を、緊張していない自然な笑顔で残しませんか?\\n※七五三の..." }unshuffled_deduplicated_jbo
An example of 'train' looks as follows.
{ "id": 1, "text": "ni'o 23 la cimast. cu 23moi djedi fi'o masti la cimast. noi ke'a cu cimoi masti .i 22 la cimast. cu purlamdei .ije 24 la cimast. cu bavlamdei" }unshuffled_deduplicated_jv
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"José Mourinho (diwaca: [ʒuˈzɛ moˈɾiɲu]; lair ing Setubal, Portugal, 26 Januari 1963; umur 55 taun) iku salah siji pelatih bal k..." }unshuffled_deduplicated_ka
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"წამიყვანე შენთან ერთად (ქართულად) / Возьми меня с собой (картулад) / (რუსული სერიალები ქართულად) (რუსების პორნო ონლაინში) (ruse..." }unshuffled_deduplicated_kk
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Түлкібас ауданында «Латын негізді әліпби мен емле ережесі туралы насихат» жобасының тобы семинар өткізді\\nЕлорданың «Қазақстан»..." }unshuffled_deduplicated_km
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ខ្សឹបដាក់ត្រចៀក៖ លោក សួស សុផានិត នាយផ្នែករដ្ឋបាលព្រៃឈើ ស្រុកភ្នំក្រវាញ់ ដែលទើបឡើងកាន់តំណែងថ្មី បើកដៃឲ្យឈ្នួញ ប្រព្រឹត្តបទល្មើស ..." }unshuffled_deduplicated_kn
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ರಾಷ್ಟ್ರಪತಿ ಪ್ರಣಬ್ ಮುಖರ್ಜಿಯಿಂದ ಪದ್ಮ ಪ್ರಶಸ್ತಿ ಪ್ರದಾನ | President Pranab Mukherjee Confers Padma Awards | Photo Gallery on Kannada..." }unshuffled_deduplicated_ko
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"CIA 프로젝트에서는 데이터베이스로 들어오는 요청을 중간에 수집(Sniffing)하고 수집한 데이터를 분석(Parsing)하여 그로 인한 결과를 판단하여 알릴 수 있는 시스템(Push Service)이 필요하다. 그리고 연구를 ..." }unshuffled_deduplicated_krc
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Шамханланы, Бийлени къаршысына ябушуп, Батыр уланларыбызны къоллары булан «ортакъ ожакъ» къургъанбыз. Шо иш уллу зараллы иш бол..." }unshuffled_deduplicated_ku
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Me di 114 bernameyên xwe yên berê da perçeyên ji berhemên zanyarî yên kurdzanên mezin bi wergera kurdî da ...\\nMe di 114 bernam..." }unshuffled_deduplicated_kv
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Коми кытшыслӧн ыджытжык тор вӧр увтын куйлӧ, сійӧн и фаунасӧ татӧн аркмӧтӧны вӧрын олісь подаэз. Ассямаӧн лоӧ сія, мый кытшас с..." }unshuffled_deduplicated_kw
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"????????????????????????????????????????????????????????????????????Pray without ceasing???????????????????????????????????????..." }unshuffled_deduplicated_ky
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Turmush: Бишкек шаардык кеңешинин кезексиз отурумунда мэрге ишенбөөчүлүк көрсөтүү маселеси каралат, - депутат Т.Сагынов\\nБишкек..." }unshuffled_deduplicated_la
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Hæ sunt generationes Noë: Noë vir justus atque perfectus fuit in generationibus suis; cum Deo ambulavit.\\nEcce ego adducam aqua..." }unshuffled_deduplicated_lb
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Während dem Gaardefestival \\\"Ambiance Jardins\\\" vum 15. bis de 17. Mee huet den SNJ nees zesumme mam Groupe Animateur en Inform..." }unshuffled_deduplicated_lez
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Ахцегь хуьр, виридалай ч1ехи лезги хуьрерикая я. Ам Урусатдин виридалай къиблепатавай хуьрерикай я. Ин хуьр...\"..." }unshuffled_deduplicated_li
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"'t Good Goedenraad aan de Ezerbaek besjteit oet 'n kesjtièl mèt gesjlote haof en 'n park van 26 hectare. Hie in sjtoon väól beu..." }unshuffled_deduplicated_lmo
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Serét (en tortonés: Sregh; en piemontés: Srèj) l'è 'n cümü italià, de la regiù del Piemónt, en Pruvìncia de Alessandria. El g'h..." }unshuffled_deduplicated_lo
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"ຜູ້ພິພາກສາ ປະຈຳເຂດ ສຫລ ທ່ານນຶ່ງ ຕັດສິນວ່າ ໂຄງການເກັບກຳຂໍ້ມູນ ທາງໂທລະສັບ ຂອງອົງການ ຄວາມໝັ້ນຄົງແຫ່ງຊາດ ແມ່ນຖືກຕ້ອງ ຕາມກົດໝາຍ.\\nກະ..." }unshuffled_deduplicated_lrc
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"آرلینگتون یئ گئل د شأریا ڤولاتچە ڤیرجینیا و یئ گئل د شأریا ڤولات ڤولاتچە یا یأکاگئرئتە ئمریکاە. ئی شأر دویومی کألوٙن شأر د راسا..." }unshuffled_deduplicated_lt
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Čir vir vir pavasaris! Čia čia čia… dalinamės labai simpatiška video pamokėle, kurią pristato ab888art galerija.\\nBe galo papra..." }unshuffled_deduplicated_lv
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Dekoratīvi sliekšņi MITSUBISHI OUTLANDER 2007, izgatavoti no ovālas formas, pulētas nerūsējošā tērauda caurules...\\ndažādas tūn..." }unshuffled_deduplicated_mai
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"१ · २ · ३ · ४ · ५ · ६ · ७ · ८ · ९ · १० · ११ · १२ · १३ · १४ · १५ · १६ · १७ · १८ · १९ · २० · २१ · २२ · २३ · २४ · २५ · २६ · २७ · २..." }unshuffled_deduplicated_mg
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Nanamboatra taratasy apetaka sy soso-kevitra ho an'ny olona te-hanatevin-daharana ity fihetsiketsehana ity i Anocrena.\\nNosorat..." }unshuffled_deduplicated_mhr
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Акрет жап годым Уганда кундемым Пигмей племена- влак айлен шогеныт. мемнан эран 1 курым гыч Банту племена влакат тиде кундемышк..." }unshuffled_deduplicated_min
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\" ..." }unshuffled_deduplicated_mk
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"„Филм плус“ е насловен првиот филмски месечник во Македонија, чиј прв број ќе биде промовиран вечер во „Менада“. Новото македон..." }unshuffled_deduplicated_ml
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"സ്ത്രീ പ്രവേശനം സര്ക്കാര് പൂര്ണമായും അംഗീകരിക്കുന്നുവെന്നും ശബരിമലയുടെ സുരക്ഷയില് ഇടപെടുമെന്നും സര്ക്കാര് ഹൈക്കോടതിയില്\\..." }unshuffled_deduplicated_mn
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"МУБИС-ын багш мэргэжлийн хөрвөх сургалтыг төгссөн багшид багшлах эрх олгох тухай ~ БМДИ-ийн захирлын тушаал - Багшийн мэргэжил ..." }unshuffled_deduplicated_mr
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Home / motivational marathi story / उद्योजकता (Entrepreneurship) / यांना हे जमलय, तर आपल्याला का नाही जमणार ?\\nयापैकी कोणाचीही ..." }unshuffled_deduplicated_mrj
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Лӹпӹвлӓ (латинлӓ Lepidoptera ; алыкмарла лыве-влак) — капшангывлӓ йыхыш пырышы сӱмӓн нӹл шылдыран капшангывлӓ. Цилӓжӹ 180000 тӹ..." }unshuffled_deduplicated_ms
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Sanad pertama daripada Zuhair bin Harb daripada ‘Affan daripada Hammad daripada Thabit daripada Anas.\\nSanad kedua daripada ‘Ab..." }unshuffled_deduplicated_mt
An example of 'train' looks as follows.
{ "id": 0, "text": "tibgħat il-kawża lura lill-Qorti Ġenerali għall-annullament jew għat-tnaqqis tal-penalità imposta mill-Kummissjoni bid-deċiżjoni inizjali kif emendata bid-deċiżjoni ta’ rettifika;" }unshuffled_deduplicated_mwl
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Deciplina social i outónoma que angloba atebidades de ouserbaçon, de análeze, de çcriçon, cumparaçon, de sistematizaçon i de sp..." }unshuffled_deduplicated_my
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ျမ၀တီ - ရန္ကုန္တိုင္းေဒသႀကီး ေျမာက္ဥကၠလာပႏွင္႕ ဗဟန္းၿမိဳ႔နယ္ မေကြးတိုင္း ေဒသႀကီး ပခုကၠဴၿမိဳ႔နယ္တို႔၌ ျမန္မာ႕တပ္မေတာ္အား ေထာက္ခံ..." }unshuffled_deduplicated_myv
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"2018 иень умарьковонь 6-це чистэ сась паро куля! Россиянь культурань Министерствась макссь невтемань конёв (прокатной удостовер..." }unshuffled_deduplicated_mzn
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"قرآن یا قوران اسلام ِآسمونی کتاب هسته. مسلمونون گانّّه قرآن ره خدا، وحی جه برسنییه، «محمد معجزه» هسته و ثقلین حدیث دله ونه خَو..." }unshuffled_deduplicated_nah
An example of 'train' looks as follows.
{ "id": 0, "text": "In mācuīlpōhualxihuitl VI (inic chicuacē) in mācuīlpōhualli xiuhitl cāhuitl īhuīcpa 501 xihuitl oc 600 xihuitl." }unshuffled_deduplicated_nap
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ò AUDIT í Ç è î ÿ å å 30 ò ÿ ÿ é, õ ñ ì ÿ, ê ã- ò à ì. å â å í ç â à à é ñ è å é ó ó ë. å å å û è å î é è à. à è à AUDIT 1-7 â ..." }unshuffled_deduplicated_nds
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Dor kann sik vun nu af an de hele plattdüütsche Welt – vun Niebüll bit New York, vun Helgoland bit Honolulu – drapen. Allens, w..." }unshuffled_deduplicated_ne
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"बर्दिबास नगरपालिकाको तेस्रो नगर परिषदबाट पारित आ.व.२०७३।७४ को संशोधित र २०७४।७५ को प्रस्तावित नीति, कार्यक्रम तथा बजेट\\nअार्थिक..." }unshuffled_deduplicated_new
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"थ्व शहरयागु अक्षांश ३४.७००१६४ उत्तर व देशान्तर ८६.३७६४६९ पश्चिम खः (34.700164° N 86.376469° W)। थ्व थासे ७२२६७३२ वर्ग मिटर (२.७..." }unshuffled_deduplicated_nl
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Op vrijdag 31 augustus wordt het nieuwe studiejaar van de masteropleiding architectuur geopend met een dagexcursie naar Venlo.\\..." }unshuffled_deduplicated_nn
An example of 'train' looks as follows.
{ "id": 0, "text": "Planomtale krav til innhald Bakgrunn: Spørsmål frå fleire kommunar om kva ein planomtale/planbeskrivelse bør innehalde Fylkeskommunen og fylkesmannen har i ein del saker reist motsegn på formelt grunnlag" }unshuffled_deduplicated_no
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Ytterligere aktører i primærhelsetjenesten og andre NHS-virksomheter ble infisert, inkludert legekontor.Læreren vår er så attra..." }unshuffled_deduplicated_oc
An example of 'train' looks as follows.
{ "id": 1, "text": ".рф (rf, còdi punycode: .xn--p1ai)[1] es lo nom de domeni en rus per Russia. Foguèt activat lo 12 de mai de 2010. Lo còdi latin es .ru." }unshuffled_deduplicated_or
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ଭୁବନେଶ୍ୱର, ୨୭/୧– (ଓଡ଼ିଆ ପୁଅ) ସିପିଆଇ ଜାତୀୟ ପରିଷଦର ଆହ୍ୱାନକ୍ରମେ ଗତକାଲି ଜାନୁୟାରୀ ୨୬ ସାଧାରଣତନ୍ତ୍ର ଦିବସକୁ ଦେଶ ବ୍ୟାପୀ ସମ୍ବିଧାନ ସୁରକ୍ଷା ..." }unshuffled_deduplicated_os
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"1. Лæппу æмæ чызг казрæдзийы зæрдæмæ куы фæцæуынц æмæ, куы сфæнд кæнынц сæ цард баиу кæнын, уæд лæппу бар ракуры чызгæй, цæмæй ..." }unshuffled_deduplicated_pa
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ਰਜਿ: ਨੰ: PB/JL-138/2018-20 ਜਿਲਦ 63, ਬਾਨੀ ਸੰਪਾਦਕ (ਸਵ:) ਡਾ: ਸਾਧੂ ਸਿੰਘ ਹਮਦਰਦ ਫ਼ੋਨ : 0181-2455961-62-63, 5032400, ਫੈਕਸ : 2455960, 2..." }unshuffled_deduplicated_pam
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Áku pu i Anak ning Aláya at ngeni ipákit kó kékayu ngan nûng makanánu lang susúlat détinang kulit a mágkas. Lauan ya ing tarátu..." }unshuffled_deduplicated_pl
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"System informatyczny - Załącznik nr 1 do zarządzenia Wójta Gminy Podegrodzie Nr 530/2013 z dnia 27 maja 2013 r\\nSystem informat..." }unshuffled_deduplicated_pms
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Louvigné-du-Désert a l'é na comun-a fransèisa ant la region aministrativa dla Brëtagna, ant ël dipartiment d'Ille-et-Vilaine. A..." }unshuffled_deduplicated_pnb
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"ایہ فائل Wikimedia Commons توں اے تے دوجیاں ویونتاں تے وی ورتی جاےکدی اے۔ گل بات اس دے فائل گل بات صفہ تے تھلے دتی گئی۔\"..." }unshuffled_deduplicated_ps
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Many people usually use the time period ‘business to business (B2B) advertising,’ however most of them do not know precisely wh..." }unshuffled_deduplicated_pt
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Você pode estar lendo este texto no sofá, levantar pra pegar uma breja na geladeira, dar uma cagada e sentar novamente, sem int..." }unshuffled_deduplicated_qu
An example of 'train' looks as follows.
{ "id": 1, "text": "Warayu wichay (kastilla simipi: Ascensión de Guarayos) nisqaqa Buliwya mama llaqtapi, Santa Krus suyupi, huk llaqtam, Warayu pruwinsyap uma llaqtanmi." }unshuffled_deduplicated_rm
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"practicists agrars / practicistas agraras AFP pon far ina furmaziun da basa scursanida per cuntanscher in attestat federal da q..." }unshuffled_deduplicated_ro
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"“În viață, oportunitatea nu este totul. Cine atrage Lumina, cineva bun în umbră. Timpul ne creează.” maestru\\nLyn.Evans: Ce mar..." }unshuffled_deduplicated_ru
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Доступ к данному профилю для публичного просмотра закрыт администрацией сайта - профиль находится на модерации.\\nРазработчикам ..." }unshuffled_deduplicated_sa
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"अनिरुद्धनगरे क्रीडिता रामलीला सम्प्रति समाप्ता अस्ति । तस्य कानिचन् चित्राणि पूर्वमेव प्रकाशितानि सन्ति । द्वौ चलचित्रौ अपि ..." }unshuffled_deduplicated_sah
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████..." }unshuffled_deduplicated_scn
An example of 'train' looks as follows.
{ "id": 0, "text": "La gilusìa è nu sintimentu dulurusu ca nasci d'un disideriu di pussessu sclusivu ntê cunfrunti dâ pirsuna amata e dû timuri, dû suspettu o dâ cirtizza dâ sò nfidiltati." }unshuffled_deduplicated_sd
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"هر ڪو ڄاڻي ٿو ته جڏهن توهان هڪ وڏي خريد ڪرڻ چاهيون ٿا, توهان پڄي ضروري حڪم ۾ ان جي ڪم ڪرڻ جي هٿ ۾ لاڳاپو ڪيو آهي. جي شيء آهي ته..." }unshuffled_deduplicated_sh
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Opština Gornja Radgona se nalazi u sjeveroistočnoj Sloveniji i graniči s susjednom Austriji duž rijeke Mure. Sa tridesetim nase..." }unshuffled_deduplicated_si
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"ලාංකීය සිතිවිලි සිංහල බ්ලොග් කියවනය කොත්තු සින්ඩිය ලංකා Blogger හත්මාළුව ලංකා බ්ලොග් කියවනය මාතලන්ගේ සින්ඩිය මොබයිල්lk\\nඅවකාශය ..." }unshuffled_deduplicated_sk
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Aktivity | Agentúra podporovaného zamestnávania | vzdelávanie pre klientov, vzdelávanie pre odborníkov, kurzy\\nŠpecializované k..." }unshuffled_deduplicated_sl
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Če Creatures, ki je želel, da pridejo na čas, predvsem je povedlo – razlikuje od ljubosumja začel grizenja kolen (ali zadnjica)..." }unshuffled_deduplicated_so
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"тттттттттттттттттттттттттттттттт тттттттттттттттттттттттттттттттт тттттттттттттттттттттттттттттттт ттттттттттттттттуууууууууууу..." }unshuffled_deduplicated_sq
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Çfarë do të më pëlqente tek një femër ose çfarë do të më shndërronte në një shpërthim drite? – Albert Vataj\\nTë gjithëve një zo..." }unshuffled_deduplicated_sr
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Корисни савети за сваки дан. На сајту су разне категорије, као што су љепота, мода, кување и поправка властитим рукама.\\nШколск..." }unshuffled_deduplicated_su
An example of 'train' looks as follows.
{ "id": 1, "text": "Kartu krédit nyaéta \"duit plastik\" anu dikaluarkeun ku bank pikeun alat pambayaran di tempat-tempat nu tangtu samisal jiga di hotél, réstoran, tempat rékréasi jeung sajabana.[1]" }unshuffled_deduplicated_sv
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"1783 är ett viktigt årtal i den nya tidens historia. Det året slöts en fred i Paris och därmed blev de 13 brittiska kolonierna ..." }unshuffled_deduplicated_sw
An example of 'train' looks as follows.
{ "id": 1, "text": "Miripuko hiyo inakuja mwanzoni mwa Wiki Takatifu kuelekea Pasaka na ikiwa ni wiki chache tu kabla ya Papa Francis kuanza ziara yake katika nchi hiyo yenye idadi kubwa kabisa ya watu katika ulimwengu wa nchi za Kiarabu." }unshuffled_deduplicated_ta
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"பொழுது சாய்ந்து வெகு நேரமாகிவிட்டது. கூலி வேலைக்குப் போயிருந்த 'சித்தாள் ' பெண்கள் எல்லோரும் வீடு திரும்பி விட்டார்கள். இன்னும்..." }unshuffled_deduplicated_te
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"హర్యానాలో టోల్ దగ్గర సిబ్బంది.. స్థానిక ప్రజలు కొట్టుకున్నారు. కర్నాల్ అనే గ్రామానికి సమీపంలో టోల్ గేట్ ఉంది. అయితే సాధారణంగా స..." }unshuffled_deduplicated_tg
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Ҳумайро гуфтааст, мухолифи низом аст, низоме, ки дар Тоҷикистон вуҷуд дорад. Ба ин маънӣ, худро мухолифи давлату ҳукумати Тоҷик..." }unshuffled_deduplicated_th
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ฟันที่แลดูขาวสะอาดไม่มีเศษอาหารติดอยู่ เหงือกสีชมพู ไม่เจ็บ หรือมีเลือดออกเวลาแปรงฟันหรือขัดฟัน ไม่มีปัญหาเรื่องกลิ่นปาก ทำให้ก..." }unshuffled_deduplicated_tk
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Türkmenistanyň Prezidenti agyr atletika boýunça dünýä çempionatyna taýýarlyk işleriniň barşy bilen tanyşdy\\nHalallykdan kemal t..." }unshuffled_deduplicated_tl
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"“Gusto ko manawagan sa mga Unit Head ng Chanel 2 Salve. Kasi napapansin ko iyon mga alaga ko ang taping halos once a week lang,..." }unshuffled_deduplicated_tr
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Son yıllarda görülen ay tutulmalarına göre daha etkili olacağı söylenen Kanlı veya Kırmızı Ay Tutulmasına saatler kaldı. Bu akş..." }unshuffled_deduplicated_tt
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"\\\"Иремнең вафатына 40 көн узгач, Алмаз да безнең өйгә кереп үлде\\\". Арчада 35 яшьлек ир өстенә кондызлар ега башлаган агач төшк..." }unshuffled_deduplicated_tyv
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Экии, хүндүлуг аалчылар болгаш тыва дылдың деткикчилери! Тыва дылдың болгаш чогаалдың ховар бир башкызынга, Менги Ооржакка, ажы..." }unshuffled_deduplicated_ug
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"زاڭ-ءتۇزىم | عىلىم-تەحنيكا | ءتىل-ادەبيەت | تۇرمىس | دەنە تاربيە | ساياحات-ورتا | سۋرەتتى حابار | سىر سۇحبات | ارناۋلى تاقىرىپ ..." }unshuffled_deduplicated_uk
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Про надання роз'яснення (щодо форми письмового зобов'язання громадян про зворотне ввезення/вивезення товарів), Державна митна с..." }unshuffled_deduplicated_ur
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"آئیے اہم اسلامی کتب کو یونیکوڈ میں انٹرنیٹ پر پیش کرنے کے لئے مل جل کر آن لائن ٹائپنگ کریں۔ محدث ٹائپنگ پراجیکٹ کے ذریعے آپ روز..." }unshuffled_deduplicated_uz
An example of 'train' looks as follows.
{ "id": 1, "text": "Qurama tog'lari tizmasining Toshkentdan 154 km uzoqlikdagi Toshkent-Ush yo'li yeqasidaxushmanzara tabiat qo'ynida joylashgan maydoni 30 ga.\nBolalarni sog'lomlashtirish oromgohi Bo'stonliq tumani Oqtosh muntaqasining soy-salqin gushasida joylashgan." }unshuffled_deduplicated_vec
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Par ogni pónto, ła derivada ła xe ła pendensa de ła reta tangente a ła curva de ła funsion f. Ła reta de cołor róso l'è senpre ..." }unshuffled_deduplicated_vi
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Canh chua cá bông lau không chỉ là món ăn giải nhiệt, thanh mát ngày hè mà còn là món siêu bổ dưỡng, rất tốt cho người gầy ốm. ..." }unshuffled_deduplicated_vo
An example of 'train' looks as follows.
{ "id": 1, "text": "Sarniguet binon zif in ziläk: Hautes-Pyrénées, in topäd: Midi-Pyrénées, in Fransän. Sarniguet topon videtü 43°19’ 7’’ N e lunetü 0°5’ 19’’ L." }unshuffled_deduplicated_wa
An example of 'train' looks as follows.
{ "id": 1, "text": "Cisse pådje ci n' est co k' on djermon, dj' ô bén k' el pådje est djusse sibåtcheye, eyet co trop tene; et s' divreut ele ecråxhî ene miete." }unshuffled_deduplicated_war
An example of 'train' looks as follows.
{ "id": 1, "text": "An Honce amo in usa ka baryo ngan munisipalidad ha distrito han Rožňava ha rehiyon han Košice ha nasod han Slovakia.\nAn Rumegies amo in usa ka komyun ha departamento han Nord ngan ha rehiyon han Nord-Pas-de-Calais ha nasod han Fransya." }unshuffled_deduplicated_wuu
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"伊春元旦天气 伊春腊八天气 伊春春节天气 伊春情人节天气 伊春元宵节天气 伊春愚人节天气 伊春清明节天气 伊春劳动节天气 伊春母亲节天气 伊春端午节天气 伊春七夕节天气 伊春教师节天气 伊春中秋节天气 伊春国庆节天气 伊春重阳节天气 伊春万圣节天气 伊春..." }unshuffled_deduplicated_xal
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Арнгудин Орн гисн Европд бәәдг һазр. 2007 җилин тooһaр эн орн нутгт 3,600,523 әмтн бәәдг билә. Арнгудин Орнин хотл балһсна нерн..." }unshuffled_deduplicated_xmf
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"მოჩამილი ტექსტი წჷმორინელი რე Creative Commons Attribution-ShareAlike ლიცენზიათ; შილებე გეძინელი პირობეფიშ არსებუა. კილიშკილიშა..." }unshuffled_deduplicated_yi
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ממשותדיק - חבֿרה, איך אַרבעט איצט אױף אַ זשורנאַל. טאָמער איר האָט עפּעס צוצוגעבן זאָלט איר שיקן מיר אַן אָנזאָג. ס'װעט הײסן \\\"..." }unshuffled_deduplicated_yo
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Copyright © 2018 BBC. BBC kò mọ̀ nípa àwọn ohun tí ó wà ní àwọn ojú òpó tí ó wà ní ìta. Ọwọ́ tí a fi mú ìbáṣepọ̀ ti ìta.\"..." }unshuffled_deduplicated_yue
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 你還不爆 我累了 投降輸一半可以嗎\"..." }unshuffled_deduplicated_zh
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"中国铝灰网 中国有色金属矿产网 中国黄莲网 中国水轮发电机网 中国抽油泵网 中国数控雕刻机网 中国不锈钢抛光网 中国磨具加工网 中国压铸铝网 中国耐水腻子网 中国手机摄像头网 中国粗粮网 中国车门锁网 中国钛粉网 中国轮圈网\\n天天中奖彩票图 天天中彩票..." }Click to expand the Data/size information for each language (original) unshuffled_original_af
An example of 'train' looks as follows.
{ "id": 0, "text": "aanlyn markte as gevolg van ons voortgesette 'n begrip opsie handel sakeplan pdf terwyl ons steeds die gereelde ons binêre opsies handel" }unshuffled_original_als
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"De Nazionalpark hät e Flächi vo 170,3 km² und isch dodemit s grösti Naturschutzgebiet vo de Schwiz. Er ligt uf em Gebiet vo de ..." }unshuffled_original_am
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"አየር መንገዱ ከአዲስ አበባ ወደ ሮም ጣሊያን በማምራት ላይ በነበረበት ጊዜ ረዳት አብራሪው የጉዞውን አቅጣጫ በመቀየር ጄኔቭ አውሮፓላን ማረፊያ በማሳረፍ እጁን ለፖሊስ ሰጥቷል።\\nየኢትዮጵያ መንግስት የ..." }unshuffled_original_an
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"واااااااأسفاه الأمم تفتخر ب 0 أمي ووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووو..." }unshuffled_original_ar
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"مرحبا بك عزيز الزائر نتمنى لك أوقاتاً سعيدة معنا وأن نزداد شرفا بخدمتك ولا تنسى التسجيل معنا لتستفيد بكل جديد\\nأهلا وسهلا بك زا..." }unshuffled_original_arz
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"بنى عجل : قبيلة من عجل بن لجيم بن صعب بن على بن بكر بن وائل انتقل اغلبهم الى البصرة فى العراق و اصفهان و خراسان فى ايران و اذرب..." }unshuffled_original_as
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"আমি, এই সংগঠনৰ সদস্য সকলে একেলগ হৈ অসমকে ধৰি ভাৰতৰ উত্তৰ পূৰ্বাঞ্চলৰ অমূল্য কলা-সাংস্কৃতিক সম্পদৰাজি বৃহত্তৰ অষ্ট্ৰেলিয়াৰ সন্মু..." }unshuffled_original_ast
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"The Killers llanzaron el so álbum debú, Hot Fuss, en xunu de 2004 nel Reinu Xuníu, al traviés de la discográfica Lizard King, y..." }unshuffled_original_av
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Жинда малъараб ва божизе бегьулеб рагІудаса кьуризе бегьуларо гьев. Гьес насихІат гьабизе кколелъул бацІцІадаб диналъул рахъалъ..." }unshuffled_original_az
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"AZTV-Artıq 7 ildir ki, Abşeron rayonu dotasiya almadan bütün xərclərini yerli daxilolmalar hesabına maliyyələşdirir.\\nDünən, 10..." }unshuffled_original_azb
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"لعلی ١٣-جو عصرده یاشاییب یاراتمیش گؤرکملی آذربایجان شاعرلریندندیر. ١٢٢٤-جی ایلده تبریزده آنادان اولموشدور، گنج یاشلاریندا تیجار..." }unshuffled_original_ba
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Күҙәтеү ҡуласаһы моделен хәҙер Мифтахетдин Аҡмулла исемендәге Башҡорт дәүләт педагогия университетында ла эшләргә мөмкин\\t\\nКүҙ..." }unshuffled_original_bar
An example of 'train' looks as follows.
{ "id": 0, "text": " vo" }unshuffled_original_bcl
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"& ÿ ó / í 0 - ø û ù ö ú ð ï ú \\u0014 ù þ ô ö í ÷ ò \\u0014 ÷ í ù û ö í \\u0001 û ñ ç þ \\u0001 ð \\u0007 þ ò ñ ñ ò ô \\u0017 û ö ô ÷..." }unshuffled_original_be
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Брэсцкія ўлады не дазволілі прафсаюзу РЭП правесці пікетаванне ў парку Воінаў-інтэрнацыяналістаў 30 мая 2018 года.\\nСітуацыю пр..." }unshuffled_original_bg
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ЖАЛБОПОДАТЕЛЯТ директор на Дирекция „ Обжалване и данъчно-осигурителна практика“- Бургас, редовно призован, се представлява от ..." }unshuffled_original_bh
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"सुकमा जिला भारत के छत्तीसगढ़ राज्य में एगो जिला बाटे। एकर मुख्यालय सुकमा शहर बाटे। एकर कुल रकबा 5636 वर्ग कि॰मी॰ बाटे।\"..." }unshuffled_original_bn
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ভড়ং সর্বস্ব বাংলা আর্ট অ্যান্ড কালচারের হিসাব গুলিয়ে দেওয়ার ম্যাজিকের নাম ব্রাত্য রাইসু November 23, 2017\\nভড়ং সর্বস্ব বাংলা আর..." }unshuffled_original_bo
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"བོད་མི་འདི་དག་ནི་རང་རྒྱུད་སྒོ་རུ་ཕུད་དེ་གཞན་རྒྱུད་པང་དུ་ཉར་ནས་གསོ་སྐྱོང་བྱེད་དགོས་ཟེར་བ་དང་གཅིག་མཚུངས་རེད།\\nཚན་རིག་ནི་དང་ཐོག་རང..." }unshuffled_original_bpy
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"পৌরসভা এহার আয়তন (লয়াহান) ২,৭৩০,.৬৩ বর্গ কিলোমিটার। পৌরসভা এহার মাপাহানর অক্ষাংশ বারো দ্রাঘিমাংশ ইলতাই 18.63° S 48.18° W ।[১]..." }unshuffled_original_br
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Ar mank Magalhães(Daveoù a vank) a zo ur spesad evned, Spheniscus magellanicus an anv skiantel anezhañ.\\nGallout a reer implijo..." }unshuffled_original_bs
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ž šř é ú šř šř ě šř ž é č ě ž ů ě ď éé ýš ě ě Ž č š ý ě ď é ýš ě ď ě éé ýš ě č ž ě š ý ď ě ýš é ú č ž č š ý ď ý ž é éě ď é č ýš..." }unshuffled_original_bxr
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"2002 оной хабар буряад хэлэ бэшэгэй һалбари Үндэһэтэнэй хүмүүнлиг ухаанай дээдэ һургуули болгогдожо өөршэлэгдөө.\\nХарин мүнөө б..." }unshuffled_original_ca
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Daniel Vendrell, conegut com Vandrell, ha sigut un dels il•lustradors contemporanis més influents, representant a la nova onada..." }unshuffled_original_cbk
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano..." }unshuffled_original_ce
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Шаьш анархисташ ду бохучу жигархойн дIахьедарехь дуьйцу, оьрсийн ницкъаллийн структурийн а, федералан каналан а Iалашонаш \\\"мар..." }unshuffled_original_ceb
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Si Isko walay pupamilok nga nagtan-aw sa unahan, natugaw. “Naunsa ka gud diha Isko nga layo man kaayo ang imong panan-aw?” ni I..." }unshuffled_original_ckb
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"رسی رۆژ - ساڵێک دوای بومەلەرزەی کرماشان میوانی بەرنامە : کاک سیاوەش حەیاتی چالاکی مەدەنی -قەسری شیرین\\nپارچە موزیک 30 / 10 / 20..." }unshuffled_original_cs
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Akce anarchistů proti připravovanému novému služební řádu a nízkým mzdám 1903 – Historie českého anarchismu (1880 – 1939)\\nRost..." }unshuffled_original_cv
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Шыранӑ чухне ӑнсӑртран латин кирилл саспаллисем вырӑнне латин саспаллисене ҫырсан, сайт эсир ҫырнине юсама тӑрӑшӗ.\\nКу сайтра ч..." }unshuffled_original_cy
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Mae capeli Cymreig yr Andes ym Mhatagonia wedi cyhoeddi na fydd gwasanaethau yno weddill y mis, oherwydd yr eira trwm sydd wedi..." }unshuffled_original_da
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Den 2.-5. februar 2016 løb det tredje kursus i uddannelsen af 4kommunesamarbejdets Local Impact Coaches, af stablen i Gentofte ..." }unshuffled_original_de
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Auf dieser Seite gibt es mind. ein YouTube Video. Cookies für diese Website wurden abgelehnt. Dadurch können keine YouTube Vide..." }unshuffled_original_diq
An example of 'train' looks as follows.
{ "id": 0, "text": "Zıwanê Slawki, zıwano merdumanê Slawano. Zıwanê Slawki yew lızgeyê Zıwananê Hind u Ewropao. Keyeyê Zıwananê Slawki beno hirê letey:" }unshuffled_original_dsb
An example of 'train' looks as follows.
{ "id": 1, "text": "Pśiklaskaju južo pśed pśedstajenim... 1500 źiśi njamóžo wěcej docakaś, měsćańska hala w Chóśebuzu - wupśedana." }unshuffled_original_dv
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ބ. އަތޮޅުގައި ހުޅުވަން ތައްޔާރުވަމުން އަންނަ ވައްކަރު ރިސޯޓުގައި ވަޒީފާ އަދާކުރަން ޝައުގުވެރިވާ ފަރާތްތަކަށް ކުރިމަތިލުމުގެ ފުރ..." }unshuffled_original_el
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Νεκρός εντοπίστηκε μέσα στο σπίτι του στην οδό Ηρώδου Αττικού στον αριθμό 7 ο επικεφαλής του προξενικού τμήματος της Ρωσικής πρ..." }unshuffled_original_eml
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"A séguit dal prucès ad rubutiśasiòṅ di abitànt dal pòpul ad Mikenes, Angoras 'l è finî dènt'r a 'n robot cun la tèsta dna rana ..." }unshuffled_original_en
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, which he shared with John Blanchard during his first visi..." }unshuffled_original_eo
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Ĉu ... preĝi | mediti | ricevi instigojn || kanti | muziki || informiĝi | legi | studi || prepari Diservon\\nTemas pri kolekto d..." }unshuffled_original_es
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Como se librará de la celulitis en el gimnasio La piel superflua en las manos después del adelgazamiento, Los bailes fáciles pa..." }unshuffled_original_et
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"MTÜ AB Video järgib oma tegevuses kodanikuühenduste eetilise tegevuse üldtunnustatud põhimõtteid, mis on lühidalt kokkuvõetud 7..." }unshuffled_original_eu
An example of 'train' looks as follows.
{ "id": 0, "text": "Gure jarduerek eraikuntzarekin, elkarbizitzarekin, hirigintzarekin eta ekologiarekin dute harremana, baita ideia eta konponbideak irudikatu eta garatzearekin ere, eraikuntza sektorea hobetuz, pertsonen erosotasuna eta bizi-kalitatea hobetzeko." }unshuffled_original_fa
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"قـــــــــــــــــرار بود با هم کنـــــــــــــار بیایم نه اینکه از کنــــــــــــار هم رد بشیم...!!!\\nاگر روزی دلت لبریز غم بو..." }unshuffled_original_fi
An example of 'train' looks as follows.
{ "id": 1, "text": "Kiitos Deelle kaikesta - 1,5 viikkoa kulunut, kun Dee ei ole enää ollut omani. Reilu viikko sitten sunnuntaina vein Deen uuteen kotiinsa. Itselläni on ollut niin ristiriitaiset t..." }unshuffled_original_fr
An example of 'train' looks as follows.
{ "id": 0, "text": "Média de débat d'idées, de culture et de littérature. Récits, décryptages, analyses, portraits et critiques autour de la vie des idées. Magazine engagé, ouvert aux autres et au monde.. Bring up to date in french" }unshuffled_original_frr
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Hiragana’ Practice’Sheet’1’(A -O)’ ’ Name:’________ __________________________’Section:’_______________ _’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ..." }unshuffled_original_fy
An example of 'train' looks as follows.
{ "id": 1, "text": "Nim in sêfte ride op Holmsjön, yn ien fan 'e lytse marren yn de omkriten, of nim se op avontueren lykas nonresidential. lâns Indalsälven wetter. Holm Sportklubb hawwe kano 's te huur, yn gearwurking mei de Baltyske Power konferinsje." }unshuffled_original_ga
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Is fóram é seo chun plé a dhéanamh ar an leabhar atá roghnaithe do mhí na Samhna 2013 amháin. Ní féidir ach le baill chláraithe..." }unshuffled_original_gd
An example of 'train' looks as follows.
{ "id": 0, "text": "Zhou Yujun, a 'phàrtaidh Rùnaire Comataidh Sgìre Yanfeng ann Hengyang bhaile agus a Sgìre pàrtaidh agus an riaghaltas a' bhuidheann-riochdachaidh a 'tighinn a chèilidh air ar companaidh air Apr. 14, 2017." }unshuffled_original_gl
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"O persoal de Inditex da provincia de Pontevedra segue a reclamar iguais condicións laborais no conxunto do país - CIG: Confeder..." }unshuffled_original_gn
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"º ÑÆÚÓ À Ã Ð É Æ ¾ ÄÂ Î À ¼ Æ É ÄÛ = Ü Ý\\\"Þ ßà á â ã ä å æçè ã é ê â å àë ì æê íî é á ë ï í çì àð í Ü à ñ ê é ò ä ì\"..." }unshuffled_original_gom
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"दुष्ट शीळ हें कौरवांचें । रामें सविस्तर देखूनि साचें । बोलिले वचनें जें दुर्वाचे । करी तयांचें अनुस्मरण ॥२२०॥\"..." }unshuffled_original_gu
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"અધિક માસ ચાલે છે. સમગ્ર ભારતમાં અને તેમાંય ખાસ કરીને પવિત્ર કે ધાર્મિક કહેવાય છે તેવા સ્થાનક પર કથાનો દોર ચાલે છે. ઉનાળાની કાળઝ..." }unshuffled_original_he
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"זקוקים לרשתות נגד יתושים? מחפשים רשת מתאימה לחלון צר וקטן? רשתות נגד יתושים אקורדיון של חברת קליר-מש הן הפתרון.\\nרשתות לחלונות ..." }unshuffled_original_hi
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"'आइटम गर्ल' बनकर हिट हुई थीं राखी सावंत, आज करीना-कटरीना तक फॉलो कर रही हैं ट्रेंड नक्सलियों का दम निकालेगा बाइक ग्रेनेड लॉन्च..." }unshuffled_original_hr
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"U raspravi je sudjelovao i HSS-ov saborski zastupnik rekavši kako poljoprivrednici ne osjete mjere o kojima ministar govori jer..." }unshuffled_original_hsb
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Budyšin (SN/BŠe). Elektronikarjo mějachu lětsa cyle hinaši zazběh do swojeho wukubłanja. Wokrjesne rjemjeslnistwo bě mjenujcy w..." }unshuffled_original_ht
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan..." }unshuffled_original_hu
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"monster - Amatőr, házi szex videók és kezdő csjaok pornó filmjei. - Free amateur, home made sex videos and online porn movies. ..." }unshuffled_original_hy
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Արցախի Հանրապետության հռչակման 26-րդ տարեդարձի կապակցությամբ Շուշիի Արվեստի կենտրոնում կազմակերպվել է մոսկվաբնակ նկարիչներ՝ հայ..." }unshuffled_original_ia
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha h..." }unshuffled_original_id
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Perihal dari itu, kalau kunci hal yang demikian hilang, pemilik wajib melapor ke bengkel sah untuk dibuatkan kunci baru dengan ..." }unshuffled_original_ie
An example of 'train' looks as follows.
{ "id": 0, "text": "Plastic Yo Yo Metal Yo Yos Wooden Yo Yo Keychain Yo Yo Translucent Yo Yo Light Up Yo Yo Globe Yo Yo Stress Reliever Yo Yo Jellyfish Yo Yo Sports Ball Yo Yo Sound Yo Yo Miniature Yo Yo Promotional Yo Yo Novelty Yo Yo Video Game Yo Yo ECO Recycled Yo Yo" }unshuffled_original_ilo
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Segun ken ni Ping-ay, ti yellow corn ti maysa kadagiti nadakamat a liberalized agricultural commodity iti daytoy a free trade k..." }unshuffled_original_io
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Chekia esas parlamentala republiko. La chefo di stato esas la prezidanto. Til 2013 lu elektesis dal parlamento. Pos ta yaro, ol..." }unshuffled_original_is
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Eyjar.net - upplýsinga- og fréttamiðill um Vestmannaeyjar - Fréttir - Nái núverandi stefna stjórnvalda fram að ganga mun það va..." }unshuffled_original_it
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Jaundice - causes, treatment & pathology massaggio a osteochondrosis dellindizio di una controindicazione\\nTrattamento su un co..." }unshuffled_original_ja
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"神社などへ一緒に同行して、様々な角度のショットで家族写真やお子様の写真を撮影致します!お好みに合わせて様々な写真を取ることができますので、その場でカメラマンへのリクエストも可能です!お子様の晴れ姿を、緊張していない自然な笑顔で残しませんか?\\n※七五三の..." }unshuffled_original_jbo
An example of 'train' looks as follows.
{ "id": 1, "text": "ni'o 23 la cimast. cu 23moi djedi fi'o masti la cimast. noi ke'a cu cimoi masti .i 22 la cimast. cu purlamdei .ije 24 la cimast. cu bavlamdei" }unshuffled_original_jv
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"José Mourinho (diwaca: [ʒuˈzɛ moˈɾiɲu]; lair ing Setubal, Portugal, 26 Januari 1963; umur 55 taun) iku salah siji pelatih bal k..." }unshuffled_original_ka
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"წამიყვანე შენთან ერთად (ქართულად) / Возьми меня с собой (картулад) / (რუსული სერიალები ქართულად) (რუსების პორნო ონლაინში) (ruse..." }unshuffled_original_kk
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Түлкібас ауданында «Латын негізді әліпби мен емле ережесі туралы насихат» жобасының тобы семинар өткізді\\nЕлорданың «Қазақстан»..." }unshuffled_original_km
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ខ្សឹបដាក់ត្រចៀក៖ លោក សួស សុផានិត នាយផ្នែករដ្ឋបាលព្រៃឈើ ស្រុកភ្នំក្រវាញ់ ដែលទើបឡើងកាន់តំណែងថ្មី បើកដៃឲ្យឈ្នួញ ប្រព្រឹត្តបទល្មើស ..." }unshuffled_original_kn
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ರಾಷ್ಟ್ರಪತಿ ಪ್ರಣಬ್ ಮುಖರ್ಜಿಯಿಂದ ಪದ್ಮ ಪ್ರಶಸ್ತಿ ಪ್ರದಾನ | President Pranab Mukherjee Confers Padma Awards | Photo Gallery on Kannada..." }unshuffled_original_ko
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"CIA 프로젝트에서는 데이터베이스로 들어오는 요청을 중간에 수집(Sniffing)하고 수집한 데이터를 분석(Parsing)하여 그로 인한 결과를 판단하여 알릴 수 있는 시스템(Push Service)이 필요하다. 그리고 연구를 ..." }unshuffled_original_krc
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Шамханланы, Бийлени къаршысына ябушуп, Батыр уланларыбызны къоллары булан «ортакъ ожакъ» къургъанбыз. Шо иш уллу зараллы иш бол..." }unshuffled_original_ku
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Me di 114 bernameyên xwe yên berê da perçeyên ji berhemên zanyarî yên kurdzanên mezin bi wergera kurdî da ...\\nMe di 114 bernam..." }unshuffled_original_kv
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Коми кытшыслӧн ыджытжык тор вӧр увтын куйлӧ, сійӧн и фаунасӧ татӧн аркмӧтӧны вӧрын олісь подаэз. Ассямаӧн лоӧ сія, мый кытшас с..." }unshuffled_original_kw
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"????????????????????????????????????????????????????????????????????Pray without ceasing???????????????????????????????????????..." }unshuffled_original_ky
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Turmush: Бишкек шаардык кеңешинин кезексиз отурумунда мэрге ишенбөөчүлүк көрсөтүү маселеси каралат, - депутат Т.Сагынов\\nБишкек..." }unshuffled_original_la
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Hæ sunt generationes Noë: Noë vir justus atque perfectus fuit in generationibus suis; cum Deo ambulavit.\\nEcce ego adducam aqua..." }unshuffled_original_lb
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Während dem Gaardefestival \\\"Ambiance Jardins\\\" vum 15. bis de 17. Mee huet den SNJ nees zesumme mam Groupe Animateur en Inform..." }unshuffled_original_lez
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Ахцегь хуьр, виридалай ч1ехи лезги хуьрерикая я. Ам Урусатдин виридалай къиблепатавай хуьрерикай я. Ин хуьр...\"..." }unshuffled_original_li
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"'t Good Goedenraad aan de Ezerbaek besjteit oet 'n kesjtièl mèt gesjlote haof en 'n park van 26 hectare. Hie in sjtoon väól beu..." }unshuffled_original_lmo
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Serét (en tortonés: Sregh; en piemontés: Srèj) l'è 'n cümü italià, de la regiù del Piemónt, en Pruvìncia de Alessandria. El g'h..." }unshuffled_original_lo
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"ຜູ້ພິພາກສາ ປະຈຳເຂດ ສຫລ ທ່ານນຶ່ງ ຕັດສິນວ່າ ໂຄງການເກັບກຳຂໍ້ມູນ ທາງໂທລະສັບ ຂອງອົງການ ຄວາມໝັ້ນຄົງແຫ່ງຊາດ ແມ່ນຖືກຕ້ອງ ຕາມກົດໝາຍ.\\nກະ..." }unshuffled_original_lrc
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"آرلینگتون یئ گئل د شأریا ڤولاتچە ڤیرجینیا و یئ گئل د شأریا ڤولات ڤولاتچە یا یأکاگئرئتە ئمریکاە. ئی شأر دویومی کألوٙن شأر د راسا..." }unshuffled_original_lt
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Čir vir vir pavasaris! Čia čia čia… dalinamės labai simpatiška video pamokėle, kurią pristato ab888art galerija.\\nBe galo papra..." }unshuffled_original_lv
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Dekoratīvi sliekšņi MITSUBISHI OUTLANDER 2007, izgatavoti no ovālas formas, pulētas nerūsējošā tērauda caurules...\\ndažādas tūn..." }unshuffled_original_mai
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"१ · २ · ३ · ४ · ५ · ६ · ७ · ८ · ९ · १० · ११ · १२ · १३ · १४ · १५ · १६ · १७ · १८ · १९ · २० · २१ · २२ · २३ · २४ · २५ · २६ · २७ · २..." }unshuffled_original_mg
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Nanamboatra taratasy apetaka sy soso-kevitra ho an'ny olona te-hanatevin-daharana ity fihetsiketsehana ity i Anocrena.\\nNosorat..." }unshuffled_original_mhr
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Акрет жап годым Уганда кундемым Пигмей племена- влак айлен шогеныт. мемнан эран 1 курым гыч Банту племена влакат тиде кундемышк..." }unshuffled_original_min
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\" ..." }unshuffled_original_mk
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"„Филм плус“ е насловен првиот филмски месечник во Македонија, чиј прв број ќе биде промовиран вечер во „Менада“. Новото македон..." }unshuffled_original_ml
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"സ്ത്രീ പ്രവേശനം സര്ക്കാര് പൂര്ണമായും അംഗീകരിക്കുന്നുവെന്നും ശബരിമലയുടെ സുരക്ഷയില് ഇടപെടുമെന്നും സര്ക്കാര് ഹൈക്കോടതിയില്\\..." }unshuffled_original_mn
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Монгол улс, Улаанбаатар хот - 14191 Энхтайваны өргөн чөлөө - 10, Багш хөгжлийн ордон, Багшийн мэргэжил дээшлүүлэх институт\\nБаг..." }unshuffled_original_mr
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Home / motivational marathi story / उद्योजकता (Entrepreneurship) / यांना हे जमलय, तर आपल्याला का नाही जमणार ?\\nयापैकी कोणाचीही ..." }unshuffled_original_mrj
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Лӹпӹвлӓ (латинлӓ Lepidoptera ; алыкмарла лыве-влак) — капшангывлӓ йыхыш пырышы сӱмӓн нӹл шылдыран капшангывлӓ. Цилӓжӹ 180000 тӹ..." }unshuffled_original_ms
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Sanad pertama daripada Zuhair bin Harb daripada ‘Affan daripada Hammad daripada Thabit daripada Anas.\\nSanad kedua daripada ‘Ab..." }unshuffled_original_mt
An example of 'train' looks as follows.
{ "id": 0, "text": "tibgħat il-kawża lura lill-Qorti Ġenerali għall-annullament jew għat-tnaqqis tal-penalità imposta mill-Kummissjoni bid-deċiżjoni inizjali kif emendata bid-deċiżjoni ta’ rettifika;" }unshuffled_original_mwl
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Deciplina social i outónoma que angloba atebidades de ouserbaçon, de análeze, de çcriçon, cumparaçon, de sistematizaçon i de sp..." }unshuffled_original_my
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ျမ၀တီ - ရန္ကုန္တိုင္းေဒသႀကီး ေျမာက္ဥကၠလာပႏွင္႕ ဗဟန္းၿမိဳ႔နယ္ မေကြးတိုင္း ေဒသႀကီး ပခုကၠဴၿမိဳ႔နယ္တို႔၌ ျမန္မာ႕တပ္မေတာ္အား ေထာက္ခံ..." }unshuffled_original_myv
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"2018 иень умарьковонь 6-це чистэ сась паро куля! Россиянь культурань Министерствась макссь невтемань конёв (прокатной удостовер..." }unshuffled_original_mzn
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"قرآن یا قوران اسلام ِآسمونی کتاب هسته. مسلمونون گانّّه قرآن ره خدا، وحی جه برسنییه، «محمد معجزه» هسته و ثقلین حدیث دله ونه خَو..." }unshuffled_original_nah
An example of 'train' looks as follows.
{ "id": 0, "text": "In mācuīlpōhualxihuitl VI (inic chicuacē) in mācuīlpōhualli xiuhitl cāhuitl īhuīcpa 501 xihuitl oc 600 xihuitl." }unshuffled_original_nap
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ò AUDIT í Ç è î ÿ å å 30 ò ÿ ÿ é, õ ñ ì ÿ, ê ã- ò à ì. å â å í ç â à à é ñ è å é ó ó ë. å å å û è å î é è à. à è à AUDIT 1-7 â ..." }unshuffled_original_nds
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Dor kann sik vun nu af an de hele plattdüütsche Welt – vun Niebüll bit New York, vun Helgoland bit Honolulu – drapen. Allens, w..." }unshuffled_original_ne
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"बर्दिबास नगरपालिकाको तेस्रो नगर परिषदबाट पारित आ.व.२०७३।७४ को संशोधित र २०७४।७५ को प्रस्तावित नीति, कार्यक्रम तथा बजेट\\nअार्थिक..." }unshuffled_original_new
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"थ्व शहरयागु अक्षांश ३४.७००१६४ उत्तर व देशान्तर ८६.३७६४६९ पश्चिम खः (34.700164° N 86.376469° W)। थ्व थासे ७२२६७३२ वर्ग मिटर (२.७..." }unshuffled_original_nl
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Op vrijdag 31 augustus wordt het nieuwe studiejaar van de masteropleiding architectuur geopend met een dagexcursie naar Venlo.\\..." }unshuffled_original_nn
An example of 'train' looks as follows.
{ "id": 0, "text": "Planomtale krav til innhald Bakgrunn: Spørsmål frå fleire kommunar om kva ein planomtale/planbeskrivelse bør innehalde Fylkeskommunen og fylkesmannen har i ein del saker reist motsegn på formelt grunnlag" }unshuffled_original_no
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Ytterligere aktører i primærhelsetjenesten og andre NHS-virksomheter ble infisert, inkludert legekontor.Læreren vår er så attra..." }unshuffled_original_oc
An example of 'train' looks as follows.
{ "id": 1, "text": ".рф (rf, còdi punycode: .xn--p1ai)[1] es lo nom de domeni en rus per Russia. Foguèt activat lo 12 de mai de 2010. Lo còdi latin es .ru." }unshuffled_original_or
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ଭୁବନେଶ୍ୱର, ୨୭/୧– (ଓଡ଼ିଆ ପୁଅ) ସିପିଆଇ ଜାତୀୟ ପରିଷଦର ଆହ୍ୱାନକ୍ରମେ ଗତକାଲି ଜାନୁୟାରୀ ୨୬ ସାଧାରଣତନ୍ତ୍ର ଦିବସକୁ ଦେଶ ବ୍ୟାପୀ ସମ୍ବିଧାନ ସୁରକ୍ଷା ..." }unshuffled_original_os
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"1. Лæппу æмæ чызг казрæдзийы зæрдæмæ куы фæцæуынц æмæ, куы сфæнд кæнынц сæ цард баиу кæнын, уæд лæппу бар ракуры чызгæй, цæмæй ..." }unshuffled_original_pa
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ਰਜਿ: ਨੰ: PB/JL-138/2018-20 ਜਿਲਦ 63, ਬਾਨੀ ਸੰਪਾਦਕ (ਸਵ:) ਡਾ: ਸਾਧੂ ਸਿੰਘ ਹਮਦਰਦ ਫ਼ੋਨ : 0181-2455961-62-63, 5032400, ਫੈਕਸ : 2455960, 2..." }unshuffled_original_pam
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Áku pu i Anak ning Aláya at ngeni ipákit kó kékayu ngan nûng makanánu lang susúlat détinang kulit a mágkas. Lauan ya ing tarátu..." }unshuffled_original_pl
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"System informatyczny - Załącznik nr 1 do zarządzenia Wójta Gminy Podegrodzie Nr 530/2013 z dnia 27 maja 2013 r\\nSystem informat..." }unshuffled_original_pms
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Louvigné-du-Désert a l'é na comun-a fransèisa ant la region aministrativa dla Brëtagna, ant ël dipartiment d'Ille-et-Vilaine. A..." }unshuffled_original_pnb
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"ایہ فائل Wikimedia Commons توں اے تے دوجیاں ویونتاں تے وی ورتی جاےکدی اے۔ گل بات اس دے فائل گل بات صفہ تے تھلے دتی گئی۔\"..." }unshuffled_original_ps
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Many people usually use the time period ‘business to business (B2B) advertising,’ however most of them do not know precisely wh..." }unshuffled_original_pt
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Você pode estar lendo este texto no sofá, levantar pra pegar uma breja na geladeira, dar uma cagada e sentar novamente, sem int..." }unshuffled_original_qu
An example of 'train' looks as follows.
{ "id": 1, "text": "Warayu wichay (kastilla simipi: Ascensión de Guarayos) nisqaqa Buliwya mama llaqtapi, Santa Krus suyupi, huk llaqtam, Warayu pruwinsyap uma llaqtanmi." }unshuffled_original_rm
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"practicists agrars / practicistas agraras AFP pon far ina furmaziun da basa scursanida per cuntanscher in attestat federal da q..." }unshuffled_original_ro
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"“În viață, oportunitatea nu este totul. Cine atrage Lumina, cineva bun în umbră. Timpul ne creează.” maestru\\nLyn.Evans: Ce mar..." }unshuffled_original_ru
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Доступ к данному профилю для публичного просмотра закрыт администрацией сайта - профиль находится на модерации.\\nРазработчикам ..." }unshuffled_original_sa
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"अनिरुद्धनगरे क्रीडिता रामलीला सम्प्रति समाप्ता अस्ति । तस्य कानिचन् चित्राणि पूर्वमेव प्रकाशितानि सन्ति । द्वौ चलचित्रौ अपि ..." }unshuffled_original_sah
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████..." }unshuffled_original_scn
An example of 'train' looks as follows.
{ "id": 0, "text": "La gilusìa è nu sintimentu dulurusu ca nasci d'un disideriu di pussessu sclusivu ntê cunfrunti dâ pirsuna amata e dû timuri, dû suspettu o dâ cirtizza dâ sò nfidiltati." }unshuffled_original_sd
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"هر ڪو ڄاڻي ٿو ته جڏهن توهان هڪ وڏي خريد ڪرڻ چاهيون ٿا, توهان پڄي ضروري حڪم ۾ ان جي ڪم ڪرڻ جي هٿ ۾ لاڳاپو ڪيو آهي. جي شيء آهي ته..." }unshuffled_original_sh
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Opština Gornja Radgona se nalazi u sjeveroistočnoj Sloveniji i graniči s susjednom Austriji duž rijeke Mure. Sa tridesetim nase..." }unshuffled_original_si
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"ලාංකීය සිතිවිලි සිංහල බ්ලොග් කියවනය කොත්තු සින්ඩිය ලංකා Blogger හත්මාළුව ලංකා බ්ලොග් කියවනය මාතලන්ගේ සින්ඩිය මොබයිල්lk\\nඅවකාශය ..." }unshuffled_original_sk
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Aktivity | Agentúra podporovaného zamestnávania | vzdelávanie pre klientov, vzdelávanie pre odborníkov, kurzy\\nŠpecializované k..." }unshuffled_original_sl
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Če Creatures, ki je želel, da pridejo na čas, predvsem je povedlo – razlikuje od ljubosumja začel grizenja kolen (ali zadnjica)..." }unshuffled_original_so
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"тттттттттттттттттттттттттттттттт тттттттттттттттттттттттттттттттт тттттттттттттттттттттттттттттттт ттттттттттттттттуууууууууууу..." }unshuffled_original_sq
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Çfarë do të më pëlqente tek një femër ose çfarë do të më shndërronte në një shpërthim drite? – Albert Vataj\\nTë gjithëve një zo..." }unshuffled_original_sr
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Корисни савети за сваки дан. На сајту су разне категорије, као што су љепота, мода, кување и поправка властитим рукама.\\nШколск..." }unshuffled_original_su
An example of 'train' looks as follows.
{ "id": 1, "text": "Kartu krédit nyaéta \"duit plastik\" anu dikaluarkeun ku bank pikeun alat pambayaran di tempat-tempat nu tangtu samisal jiga di hotél, réstoran, tempat rékréasi jeung sajabana.[1]" }unshuffled_original_sv
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"1783 är ett viktigt årtal i den nya tidens historia. Det året slöts en fred i Paris och därmed blev de 13 brittiska kolonierna ..." }unshuffled_original_sw
An example of 'train' looks as follows.
{ "id": 1, "text": "Miripuko hiyo inakuja mwanzoni mwa Wiki Takatifu kuelekea Pasaka na ikiwa ni wiki chache tu kabla ya Papa Francis kuanza ziara yake katika nchi hiyo yenye idadi kubwa kabisa ya watu katika ulimwengu wa nchi za Kiarabu." }unshuffled_original_ta
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"பொழுது சாய்ந்து வெகு நேரமாகிவிட்டது. கூலி வேலைக்குப் போயிருந்த 'சித்தாள் ' பெண்கள் எல்லோரும் வீடு திரும்பி விட்டார்கள். இன்னும்..." }unshuffled_original_te
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"హర్యానాలో టోల్ దగ్గర సిబ్బంది.. స్థానిక ప్రజలు కొట్టుకున్నారు. కర్నాల్ అనే గ్రామానికి సమీపంలో టోల్ గేట్ ఉంది. అయితే సాధారణంగా స..." }unshuffled_original_tg
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Ҳумайро гуфтааст, мухолифи низом аст, низоме, ки дар Тоҷикистон вуҷуд дорад. Ба ин маънӣ, худро мухолифи давлату ҳукумати Тоҷик..." }unshuffled_original_th
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ฟันที่แลดูขาวสะอาดไม่มีเศษอาหารติดอยู่ เหงือกสีชมพู ไม่เจ็บ หรือมีเลือดออกเวลาแปรงฟันหรือขัดฟัน ไม่มีปัญหาเรื่องกลิ่นปาก ทำให้ก..." }unshuffled_original_tk
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"Türkmenistanyň Prezidenti agyr atletika boýunça dünýä çempionatyna taýýarlyk işleriniň barşy bilen tanyşdy\\nHalallykdan kemal t..." }unshuffled_original_tl
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"“Gusto ko manawagan sa mga Unit Head ng Chanel 2 Salve. Kasi napapansin ko iyon mga alaga ko ang taping halos once a week lang,..." }unshuffled_original_tr
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Son yıllarda görülen ay tutulmalarına göre daha etkili olacağı söylenen Kanlı veya Kırmızı Ay Tutulmasına saatler kaldı. Bu akş..." }unshuffled_original_tt
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"\\\"Иремнең вафатына 40 көн узгач, Алмаз да безнең өйгә кереп үлде\\\". Арчада 35 яшьлек ир өстенә кондызлар ега башлаган агач төшк..." }unshuffled_original_tyv
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Экии, хүндүлуг аалчылар болгаш тыва дылдың деткикчилери! Тыва дылдың болгаш чогаалдың ховар бир башкызынга, Менги Ооржакка, ажы..." }unshuffled_original_ug
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"زاڭ-ءتۇزىم | عىلىم-تەحنيكا | ءتىل-ادەبيەت | تۇرمىس | دەنە تاربيە | ساياحات-ورتا | سۋرەتتى حابار | سىر سۇحبات | ارناۋلى تاقىرىپ ..." }unshuffled_original_uk
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Про надання роз'яснення (щодо форми письмового зобов'язання громадян про зворотне ввезення/вивезення товарів), Державна митна с..." }unshuffled_original_ur
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"آئیے اہم اسلامی کتب کو یونیکوڈ میں انٹرنیٹ پر پیش کرنے کے لئے مل جل کر آن لائن ٹائپنگ کریں۔ محدث ٹائپنگ پراجیکٹ کے ذریعے آپ روز..." }unshuffled_original_uz
An example of 'train' looks as follows.
{ "id": 1, "text": "Qurama tog'lari tizmasining Toshkentdan 154 km uzoqlikdagi Toshkent-Ush yo'li yeqasidaxushmanzara tabiat qo'ynida joylashgan maydoni 30 ga.\nBolalarni sog'lomlashtirish oromgohi Bo'stonliq tumani Oqtosh muntaqasining soy-salqin gushasida joylashgan." }unshuffled_original_vec
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Par ogni pónto, ła derivada ła xe ła pendensa de ła reta tangente a ła curva de ła funsion f. Ła reta de cołor róso l'è senpre ..." }unshuffled_original_vi
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Canh chua cá bông lau không chỉ là món ăn giải nhiệt, thanh mát ngày hè mà còn là món siêu bổ dưỡng, rất tốt cho người gầy ốm. ..." }unshuffled_original_vo
An example of 'train' looks as follows.
{ "id": 1, "text": "Sarniguet binon zif in ziläk: Hautes-Pyrénées, in topäd: Midi-Pyrénées, in Fransän. Sarniguet topon videtü 43°19’ 7’’ N e lunetü 0°5’ 19’’ L." }unshuffled_original_wa
An example of 'train' looks as follows.
{ "id": 1, "text": "Cisse pådje ci n' est co k' on djermon, dj' ô bén k' el pådje est djusse sibåtcheye, eyet co trop tene; et s' divreut ele ecråxhî ene miete." }unshuffled_original_war
An example of 'train' looks as follows.
{ "id": 1, "text": "An Honce amo in usa ka baryo ngan munisipalidad ha distrito han Rožňava ha rehiyon han Košice ha nasod han Slovakia.\nAn Rumegies amo in usa ka komyun ha departamento han Nord ngan ha rehiyon han Nord-Pas-de-Calais ha nasod han Fransya." }unshuffled_original_wuu
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"伊春元旦天气 伊春腊八天气 伊春春节天气 伊春情人节天气 伊春元宵节天气 伊春愚人节天气 伊春清明节天气 伊春劳动节天气 伊春母亲节天气 伊春端午节天气 伊春七夕节天气 伊春教师节天气 伊春中秋节天气 伊春国庆节天气 伊春重阳节天气 伊春万圣节天气 伊春..." }unshuffled_original_xal
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Арнгудин Орн гисн Европд бәәдг һазр. 2007 җилин тooһaр эн орн нутгт 3,600,523 әмтн бәәдг билә. Арнгудин Орнин хотл балһсна нерн..." }unshuffled_original_xmf
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"მოჩამილი ტექსტი წჷმორინელი რე Creative Commons Attribution-ShareAlike ლიცენზიათ; შილებე გეძინელი პირობეფიშ არსებუა. კილიშკილიშა..." }unshuffled_original_yi
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"ממשותדיק - חבֿרה, איך אַרבעט איצט אױף אַ זשורנאַל. טאָמער איר האָט עפּעס צוצוגעבן זאָלט איר שיקן מיר אַן אָנזאָג. ס'װעט הײסן \\\"..." }unshuffled_original_yo
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 0, "text": "\"Copyright © 2018 BBC. BBC kò mọ̀ nípa àwọn ohun tí ó wà ní àwọn ojú òpó tí ó wà ní ìta. Ọwọ́ tí a fi mú ìbáṣepọ̀ ti ìta.\"..." }unshuffled_original_yue
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 你還不爆 我累了 投降輸一半可以嗎\"..." }unshuffled_original_zh
An example of 'train' looks as follows.
This example was too long and was cropped: { "id": 1, "text": "\"中国铝灰网 中国有色金属矿产网 中国黄莲网 中国水轮发电机网 中国抽油泵网 中国数控雕刻机网 中国不锈钢抛光网 中国磨具加工网 中国压铸铝网 中国耐水腻子网 中国手机摄像头网 中国粗粮网 中国车门锁网 中国钛粉网 中国轮圈网\\n天天中奖彩票图 天天中彩票..." }
The data fields are the same among all configs.
Language | Language code | Name original | Train original | Words original | Size original | Name deduplicated | Train deduplicated | Words deduplicated | Size deduplicated |
---|---|---|---|---|---|---|---|---|---|
Afrikaans | af | unshuffled_original_af | 201117 | 43,482,801 | 241M | unshuffled_deduplicated_af | 130640 | 29,533,437 | 163M |
Albanian | sq | unshuffled_original_sq | 672077 | 374,196,110 | 2.3G | unshuffled_deduplicated_sq | 461598 | 186,856,699 | 1.2G |
Alemannic | als | unshuffled_original_als | 7324 | 841,750 | 5.0M | unshuffled_deduplicated_als | 4518 | 459,001 | 2.8M |
Amharic | am | unshuffled_original_am | 83663 | 28,301,601 | 360M | unshuffled_deduplicated_am | 43102 | 16,086,628 | 206M |
Arabic | ar | unshuffled_original_ar | 16365602 | 8,117,162,828 | 82G | unshuffled_deduplicated_ar | 9006977 | 3,171,221,354 | 32G |
Aragonese | an | unshuffled_original_an | 2449 | 52,896 | 1.3M | unshuffled_deduplicated_an | 2025 | 45,669 | 801K |
Armenian | hy | unshuffled_original_hy | 659430 | 273,919,388 | 3.7G | unshuffled_deduplicated_hy | 396093 | 110,196,043 | 1.5G |
Assamese | as | unshuffled_original_as | 14985 | 6,956,663 | 113M | unshuffled_deduplicated_as | 9212 | 4,366,570 | 71M |
Asturian | ast | unshuffled_original_ast | 6999 | 381,005 | 2.4M | unshuffled_deduplicated_ast | 5343 | 325,237 | 2.0M |
Avaric | av | unshuffled_original_av | 456 | 24,720 | 409K | unshuffled_deduplicated_av | 360 | 19,478 | 324K |
Azerbaijani | az | unshuffled_original_az | 912330 | 322,641,710 | 2.8G | unshuffled_deduplicated_az | 626796 | 167,742,296 | 1.5G |
Bashkir | ba | unshuffled_original_ba | 42551 | 9,796,764 | 128M | unshuffled_deduplicated_ba | 27050 | 6,922,589 | 90M |
Basque | eu | unshuffled_original_eu | 506883 | 120,456,652 | 848M | unshuffled_deduplicated_eu | 256513 | 45,359,710 | 342M |
Bavarian | bar | unshuffled_original_bar | 4 | 399 | 503 | unshuffled_deduplicated_bar | 4 | 399 | 503 |
Belarusian | be | unshuffled_original_be | 586031 | 144,579,630 | 1.8G | unshuffled_deduplicated_be | 307405 | 83,499,037 | 1.1G |
Bengali | bn | unshuffled_original_bn | 1675515 | 623,575,733 | 11G | unshuffled_deduplicated_bn | 1114481 | 363,766,143 | 5.8G |
Bihari | bh | unshuffled_original_bh | 336 | 8,848 | 110K | unshuffled_deduplicated_bh | 82 | 2,875 | 34K |
Bishnupriya | bpy | unshuffled_original_bpy | 6046 | 198,286 | 4.1M | unshuffled_deduplicated_bpy | 1770 | 96,940 | 1.7M |
Bosnian | bs | unshuffled_original_bs | 2143 | 106,448 | 447K | unshuffled_deduplicated_bs | 702 | 20,485 | 116K |
Breton | br | unshuffled_original_br | 37085 | 5,013,241 | 29M | unshuffled_deduplicated_br | 14724 | 2,890,384 | 16M |
Bulgarian | bg | unshuffled_original_bg | 5869686 | 2,947,648,106 | 32G | unshuffled_deduplicated_bg | 3398679 | 1,268,114,977 | 14G |
Burmese | my | unshuffled_original_my | 232329 | 56,111,184 | 1.9G | unshuffled_deduplicated_my | 136639 | 30,102,173 | 1.1G |
Catalan | ca | unshuffled_original_ca | 4390754 | 1,360,212,450 | 8.0G | unshuffled_deduplicated_ca | 2458067 | 729,333,440 | 4.3G |
Cebuano | ceb | unshuffled_original_ceb | 56248 | 6,603,567 | 39M | unshuffled_deduplicated_ceb | 26145 | 3,675,024 | 24M |
Central Bikol | bcl | unshuffled_original_bcl | 1 | 312 | 885 | unshuffled_deduplicated_bcl | 1 | 312 | 885 |
Central Khmer | km | unshuffled_original_km | 159363 | 20,690,610 | 1.1G | unshuffled_deduplicated_km | 108346 | 10,082,245 | 581M |
Central Kurdish | ckb | unshuffled_original_ckb | 103639 | 48,478,334 | 487M | unshuffled_deduplicated_ckb | 68210 | 18,726,721 | 226M |
Chavacano | cbk | unshuffled_original_cbk | 1 | 130 | 520 | unshuffled_deduplicated_cbk | 1 | 130 | 520 |
Chechen | ce | unshuffled_original_ce | 4042 | 711,051 | 8.3M | unshuffled_deduplicated_ce | 2984 | 568,146 | 6.7M |
Chinese | zh | unshuffled_original_zh | 60137667 | 14,986,424,850 | 508G | unshuffled_deduplicated_zh | 41708901 | 6,350,215,113 | 249G |
Chuvash | cv | unshuffled_original_cv | 20281 | 3,041,614 | 39M | unshuffled_deduplicated_cv | 10130 | 2,054,810 | 26M |
Cornish | kw | unshuffled_original_kw | 203 | 8,329 | 44K | unshuffled_deduplicated_kw | 68 | 2,704 | 14K |
Croatian | hr | unshuffled_original_hr | 582219 | 34,232,765 | 226M | unshuffled_deduplicated_hr | 321484 | 16,727,640 | 110M |
Czech | cs | unshuffled_original_cs | 21001388 | 7,715,977,441 | 53G | unshuffled_deduplicated_cs | 12308039 | 3,540,997,509 | 24G |
Danish | da | unshuffled_original_da | 7664010 | 2,637,463,889 | 16G | unshuffled_deduplicated_da | 4771098 | 1,620,091,317 | 9.5G |
Dhivehi | dv | unshuffled_original_dv | 21018 | 7,559,472 | 126M | unshuffled_deduplicated_dv | 17024 | 4,726,660 | 79M |
Dimli | diq | unshuffled_original_diq | 1 | 19 | 146 | unshuffled_deduplicated_diq | 1 | 19 | 146 |
Dutch | nl | unshuffled_original_nl | 34682142 | 13,020,136,373 | 78G | unshuffled_deduplicated_nl | 20812149 | 6,598,786,137 | 39G |
Eastern Mari | mhr | unshuffled_original_mhr | 3212 | 565,992 | 7.2M | unshuffled_deduplicated_mhr | 2515 | 469,297 | 6.0M |
Egyptian Arabic | arz | unshuffled_original_arz | 158113 | 7,305,151 | 66M | unshuffled_deduplicated_arz | 79928 | 3,659,419 | 33M |
Emilian-Romagnol | eml | unshuffled_original_eml | 84 | 6,376 | 25K | unshuffled_deduplicated_eml | 80 | 6,121 | 24K |
English | en | unshuffled_original_en | 455994980 | 418,187,793,408 | 2.3T | unshuffled_deduplicated_en | 304230423 | 215,841,256,971 | 1.2T |
Erzya | myv | unshuffled_original_myv | 6 | 90 | 1.4K | unshuffled_deduplicated_myv | 5 | 78 | 1.2K |
Esperanto | eo | unshuffled_original_eo | 121171 | 48,486,161 | 299M | unshuffled_deduplicated_eo | 84752 | 37,324,446 | 228M |
Estonian | et | unshuffled_original_et | 2093621 | 643,163,730 | 4.8G | unshuffled_deduplicated_et | 1172041 | 309,931,463 | 2.3G |
Finnish | fi | unshuffled_original_fi | 8557453 | 3,196,666,419 | 27G | unshuffled_deduplicated_fi | 5326443 | 1,597,855,468 | 13G |
French | fr | unshuffled_original_fr | 96742378 | 46,896,036,417 | 282G | unshuffled_deduplicated_fr | 59448891 | 23,206,776,649 | 138G |
Galician | gl | unshuffled_original_gl | 544388 | 102,011,291 | 620M | unshuffled_deduplicated_gl | 284320 | 63,600,602 | 384M |
Georgian | ka | unshuffled_original_ka | 563916 | 171,950,621 | 3.6G | unshuffled_deduplicated_ka | 372158 | 91,569,739 | 1.9G |
German | de | unshuffled_original_de | 104913504 | 44,878,908,446 | 308G | unshuffled_deduplicated_de | 62398034 | 21,529,164,172 | 145G |
Goan Konkani | gom | unshuffled_original_gom | 640 | 124,277 | 2.2M | unshuffled_deduplicated_gom | 484 | 102,306 | 1.8M |
Guarani | gn | unshuffled_original_gn | 106 | 7,382 | 36K | unshuffled_deduplicated_gn | 68 | 4,680 | 24K |
Gujarati | gu | unshuffled_original_gu | 240691 | 72,045,701 | 1.1G | unshuffled_deduplicated_gu | 169834 | 50,023,432 | 722M |
Haitian | ht | unshuffled_original_ht | 13 | 1,014 | 3.9K | unshuffled_deduplicated_ht | 9 | 832 | 3.3K |
Hebrew | he | unshuffled_original_he | 3808397 | 2,067,753,528 | 20G | unshuffled_deduplicated_he | 2375030 | 1,032,018,056 | 9.8G |
Hindi | hi | unshuffled_original_hi | 3264660 | 1,372,234,782 | 17G | unshuffled_deduplicated_hi | 1909387 | 745,774,934 | 8.9G |
Hungarian | hu | unshuffled_original_hu | 11197780 | 5,163,936,345 | 40G | unshuffled_deduplicated_hu | 6582908 | 2,339,127,555 | 18G |
Icelandic | is | unshuffled_original_is | 625673 | 219,900,094 | 1.5G | unshuffled_deduplicated_is | 389515 | 129,818,331 | 846M |
Ido | io | unshuffled_original_io | 694 | 25,702 | 147K | unshuffled_deduplicated_io | 617 | 22,773 | 130K |
Iloko | ilo | unshuffled_original_ilo | 2638 | 142,942 | 874K | unshuffled_deduplicated_ilo | 1578 | 105,564 | 636K |
Indonesian | id | unshuffled_original_id | 16236463 | 4,574,692,265 | 30G | unshuffled_deduplicated_id | 9948521 | 2,394,957,629 | 16G |
Interlingua | ia | unshuffled_original_ia | 1040 | 180,231 | 662K | unshuffled_deduplicated_ia | 529 | 100,019 | 360K |
Interlingue | ie | unshuffled_original_ie | 101 | 5,352 | 24K | unshuffled_deduplicated_ie | 11 | 602 | 1.6K |
Irish | ga | unshuffled_original_ga | 83223 | 14,483,593 | 88M | unshuffled_deduplicated_ga | 46493 | 10,017,303 | 60M |
Italian | it | unshuffled_original_it | 46981781 | 22,248,707,341 | 137G | unshuffled_deduplicated_it | 28522082 | 11,250,012,896 | 69G |
Japanese | ja | unshuffled_original_ja | 62721527 | 4,962,979,182 | 216G | unshuffled_deduplicated_ja | 39496439 | 1,123,067,063 | 106G |
Javanese | jv | unshuffled_original_jv | 1445 | 104,896 | 659K | unshuffled_deduplicated_jv | 1163 | 86,654 | 583K |
Kalmyk | xal | unshuffled_original_xal | 39 | 10,277 | 113K | unshuffled_deduplicated_xal | 36 | 10,155 | 112K |
Kannada | kn | unshuffled_original_kn | 350363 | 81,186,863 | 1.7G | unshuffled_deduplicated_kn | 251064 | 49,343,462 | 1.1G |
Karachay-Balkar | krc | unshuffled_original_krc | 1581 | 185,436 | 2.6M | unshuffled_deduplicated_krc | 1377 | 166,496 | 2.3M |
Kazakh | kk | unshuffled_original_kk | 524591 | 191,126,469 | 2.7G | unshuffled_deduplicated_kk | 338073 | 108,388,743 | 1.5G |
Kirghiz | ky | unshuffled_original_ky | 146993 | 44,194,823 | 600M | unshuffled_deduplicated_ky | 86561 | 28,982,620 | 388M |
Komi | kv | unshuffled_original_kv | 1549 | 201,404 | 2.3M | unshuffled_deduplicated_kv | 924 | 95,243 | 1.2M |
Korean | ko | unshuffled_original_ko | 7345075 | 2,368,765,142 | 24G | unshuffled_deduplicated_ko | 3675420 | 1,120,375,149 | 12G |
Kurdish | ku | unshuffled_original_ku | 46535 | 15,561,003 | 94M | unshuffled_deduplicated_ku | 29054 | 9,946,440 | 60M |
Lao | lo | unshuffled_original_lo | 52910 | 4,133,311 | 174M | unshuffled_deduplicated_lo | 32652 | 2,583,342 | 114M |
Latin | la | unshuffled_original_la | 94588 | 4,122,201 | 26M | unshuffled_deduplicated_la | 18808 | 1,328,038 | 8.3M |
Latvian | lv | unshuffled_original_lv | 1593820 | 520,761,977 | 4.0G | unshuffled_deduplicated_lv | 843195 | 236,428,905 | 1.8G |
Lezghian | lez | unshuffled_original_lez | 1485 | 247,646 | 3.3M | unshuffled_deduplicated_lez | 1381 | 224,871 | 3.0M |
Limburgan | li | unshuffled_original_li | 137 | 4,730 | 29K | unshuffled_deduplicated_li | 118 | 4,283 | 27K |
Lithuanian | lt | unshuffled_original_lt | 2977757 | 1,159,661,742 | 8.8G | unshuffled_deduplicated_lt | 1737411 | 516,183,525 | 3.9G |
Lojban | jbo | unshuffled_original_jbo | 832 | 154,330 | 736K | unshuffled_deduplicated_jbo | 617 | 141,973 | 678K |
Lombard | lmo | unshuffled_original_lmo | 1401 | 75,229 | 443K | unshuffled_deduplicated_lmo | 1374 | 73,665 | 433K |
Low German | nds | unshuffled_original_nds | 18174 | 2,906,347 | 18M | unshuffled_deduplicated_nds | 8714 | 2,146,417 | 13M |
Lower Sorbian | dsb | unshuffled_original_dsb | 65 | 1,787 | 13K | unshuffled_deduplicated_dsb | 37 | 966 | 7.1K |
Luxembourgish | lb | unshuffled_original_lb | 34807 | 4,403,577 | 29M | unshuffled_deduplicated_lb | 21735 | 3,087,650 | 21M |
Macedonian | mk | unshuffled_original_mk | 437871 | 189,289,873 | 2.1G | unshuffled_deduplicated_mk | 299457 | 102,849,595 | 1.2G |
Maithili | mai | unshuffled_original_mai | 123 | 69,161 | 317K | unshuffled_deduplicated_mai | 25 | 874 | 11K |
Malagasy | mg | unshuffled_original_mg | 17957 | 3,068,360 | 21M | unshuffled_deduplicated_mg | 13343 | 1,872,044 | 13M |
Malay | ms | unshuffled_original_ms | 534016 | 16,696,882 | 111M | unshuffled_deduplicated_ms | 183443 | 6,045,753 | 42M |
Malayalam | ml | unshuffled_original_ml | 603937 | 189,534,472 | 4.9G | unshuffled_deduplicated_ml | 453904 | 95,892,551 | 2.5G |
Maltese | mt | unshuffled_original_mt | 26598 | 2,995,654 | 24M | unshuffled_deduplicated_mt | 16383 | 2,163,358 | 17M |
Marathi | mr | unshuffled_original_mr | 326804 | 162,609,404 | 2.7G | unshuffled_deduplicated_mr | 212556 | 82,130,803 | 1.4G |
Mazanderani | mzn | unshuffled_original_mzn | 1055 | 73,870 | 691K | unshuffled_deduplicated_mzn | 917 | 64,481 | 602K |
Minangkabau | min | unshuffled_original_min | 220 | 5,682 | 608K | unshuffled_deduplicated_min | 166 | 4,825 | 310K |
Mingrelian | xmf | unshuffled_original_xmf | 3783 | 299,098 | 5.8M | unshuffled_deduplicated_xmf | 2418 | 228,629 | 4.4M |
Mirandese | mwl | unshuffled_original_mwl | 8 | 171 | 1.2K | unshuffled_deduplicated_mwl | 7 | 152 | 1.1K |
Modern Greek | el | unshuffled_original_el | 10425596 | 5,479,180,137 | 62G | unshuffled_deduplicated_el | 6521169 | 2,412,419,435 | 27G |
Mongolian | mn | unshuffled_original_mn | 395605 | 181,307,167 | 2.2G | unshuffled_deduplicated_mn | 197878 | 68,362,013 | 838M |
Nahuatl languages | nah | unshuffled_original_nah | 61 | 1,234 | 12K | unshuffled_deduplicated_nah | 58 | 1,193 | 11K |
Neapolitan | nap | unshuffled_original_nap | 73 | 5,282 | 17K | unshuffled_deduplicated_nap | 55 | 4,147 | 13K |
Nepali | ne | unshuffled_original_ne | 299938 | 107,448,208 | 1.8G | unshuffled_deduplicated_ne | 219334 | 71,628,317 | 1.2G |
Newari | new | unshuffled_original_new | 4696 | 564,697 | 5.5M | unshuffled_deduplicated_new | 2126 | 288,995 | 4.1M |
Northern Frisian | frr | unshuffled_original_frr | 7 | 1,516 | 4.4K | unshuffled_deduplicated_frr | 7 | 1,516 | 4.4K |
Northern Luri | lrc | unshuffled_original_lrc | 88 | 8,022 | 76K | unshuffled_deduplicated_lrc | 72 | 6,740 | 63K |
Norwegian | no | unshuffled_original_no | 5546211 | 1,344,326,388 | 8.0G | unshuffled_deduplicated_no | 3229940 | 804,894,377 | 4.7G |
Norwegian Nynorsk | nn | unshuffled_original_nn | 185884 | 14,764,980 | 85M | unshuffled_deduplicated_nn | 109118 | 9,435,139 | 54M |
Occitan | oc | unshuffled_original_oc | 10709 | 750,301 | 5.8M | unshuffled_deduplicated_oc | 6485 | 512,678 | 3.7M |
Oriya | or | unshuffled_original_or | 59463 | 14,938,567 | 248M | unshuffled_deduplicated_or | 44230 | 11,321,740 | 188M |
Ossetian | os | unshuffled_original_os | 5213 | 1,031,268 | 13M | unshuffled_deduplicated_os | 2559 | 878,765 | 11M |
Pampanga | pam | unshuffled_original_pam | 3 | 130 | 760 | unshuffled_deduplicated_pam | 1 | 52 | 304 |
Panjabi | pa | unshuffled_original_pa | 127467 | 61,847,806 | 763M | unshuffled_deduplicated_pa | 87235 | 37,555,835 | 460M |
Persian | fa | unshuffled_original_fa | 13704702 | 9,096,554,121 | 79G | unshuffled_deduplicated_fa | 8203495 | 4,363,505,319 | 38G |
Piemontese | pms | unshuffled_original_pms | 3225 | 362,013 | 2.1M | unshuffled_deduplicated_pms | 2859 | 337,246 | 1.9M |
Polish | pl | unshuffled_original_pl | 35440972 | 15,277,255,137 | 109G | unshuffled_deduplicated_pl | 20682611 | 6,708,709,674 | 47G |
Portuguese | pt | unshuffled_original_pt | 42114520 | 20,641,903,898 | 124G | unshuffled_deduplicated_pt | 26920397 | 10,751,156,918 | 64G |
Pushto | ps | unshuffled_original_ps | 98216 | 46,559,441 | 361M | unshuffled_deduplicated_ps | 67921 | 31,347,348 | 242M |
Quechua | qu | unshuffled_original_qu | 452 | 10,186 | 78K | unshuffled_deduplicated_qu | 411 | 8,691 | 67K |
Romanian | ro | unshuffled_original_ro | 9387265 | 3,984,317,058 | 25G | unshuffled_deduplicated_ro | 5044757 | 1,741,794,069 | 11G |
Romansh | rm | unshuffled_original_rm | 41 | 1,093 | 7.4K | unshuffled_deduplicated_rm | 34 | 960 | 6.5K |
Russia Buriat | bxr | unshuffled_original_bxr | 42 | 963 | 13K | unshuffled_deduplicated_bxr | 36 | 809 | 11K |
Russian | ru | unshuffled_original_ru | 161836003 | 92,522,407,837 | 1.2T | unshuffled_deduplicated_ru | 115954598 | 46,692,691,520 | 568G |
Sanskrit | sa | unshuffled_original_sa | 14291 | 4,331,569 | 93M | unshuffled_deduplicated_sa | 7121 | 1,713,930 | 37M |
Scottish Gaelic | gd | unshuffled_original_gd | 5799 | 310,689 | 1.9M | unshuffled_deduplicated_gd | 3883 | 207,110 | 1.3M |
Serbian | sr | unshuffled_original_sr | 1013619 | 364,395,411 | 3.9G | unshuffled_deduplicated_sr | 645747 | 207,561,168 | 2.2G |
Serbo-Croatian | sh | unshuffled_original_sh | 36700 | 5,292,184 | 25M | unshuffled_deduplicated_sh | 17610 | 1,040,573 | 5.8M |
Sicilian | scn | unshuffled_original_scn | 21 | 554 | 3.3K | unshuffled_deduplicated_scn | 17 | 468 | 2.8K |
Sindhi | sd | unshuffled_original_sd | 44280 | 43,530,158 | 347M | unshuffled_deduplicated_sd | 33925 | 33,028,015 | 263M |
Sinhala | si | unshuffled_original_si | 203082 | 93,053,465 | 1.4G | unshuffled_deduplicated_si | 120684 | 50,864,857 | 802M |
Slovak | sk | unshuffled_original_sk | 5492194 | 1,322,247,763 | 9.1G | unshuffled_deduplicated_sk | 2820821 | 656,346,179 | 4.5G |
Slovenian | sl | unshuffled_original_sl | 1746604 | 387,399,700 | 2.5G | unshuffled_deduplicated_sl | 886223 | 193,926,684 | 1.3G |
Somali | so | unshuffled_original_so | 156 | 1,202 | 61K | unshuffled_deduplicated_so | 42 | 472 | 16K |
South Azerbaijani | azb | unshuffled_original_azb | 15446 | 2,175,054 | 27M | unshuffled_deduplicated_azb | 9985 | 1,528,709 | 19M |
Spanish | es | unshuffled_original_es | 88199221 | 47,545,122,279 | 278G | unshuffled_deduplicated_es | 56326016 | 25,928,290,729 | 149G |
Sundanese | su | unshuffled_original_su | 805 | 30,321 | 211K | unshuffled_deduplicated_su | 511 | 20,278 | 141K |
Swahili | sw | unshuffled_original_sw | 41986 | 2,211,927 | 13M | unshuffled_deduplicated_sw | 24803 | 1,376,963 | 8.1M |
Swedish | sv | unshuffled_original_sv | 17395625 | 7,155,994,312 | 44G | unshuffled_deduplicated_sv | 11014487 | 4,106,120,608 | 25G |
Tagalog | tl | unshuffled_original_tl | 458206 | 98,949,299 | 573M | unshuffled_deduplicated_tl | 294132 | 70,121,601 | 407M |
Tajik | tg | unshuffled_original_tg | 89002 | 31,758,142 | 379M | unshuffled_deduplicated_tg | 56259 | 21,029,893 | 249M |
Tamil | ta | unshuffled_original_ta | 1263280 | 420,537,132 | 9.3G | unshuffled_deduplicated_ta | 833101 | 226,013,330 | 5.1G |
Tatar | tt | unshuffled_original_tt | 135923 | 51,034,893 | 670M | unshuffled_deduplicated_tt | 82738 | 23,825,695 | 305M |
Telugu | te | unshuffled_original_te | 475703 | 123,711,517 | 2.5G | unshuffled_deduplicated_te | 312644 | 79,094,167 | 1.6G |
Thai | th | unshuffled_original_th | 6064129 | 951,743,087 | 36G | unshuffled_deduplicated_th | 3749826 | 368,965,202 | 16G |
Tibetan | bo | unshuffled_original_bo | 26795 | 1,483,589 | 187M | unshuffled_deduplicated_bo | 15762 | 936,556 | 138M |
Turkish | tr | unshuffled_original_tr | 18535253 | 7,577,388,700 | 60G | unshuffled_deduplicated_tr | 11596446 | 3,365,734,289 | 27G |
Turkmen | tk | unshuffled_original_tk | 6456 | 1,113,869 | 11M | unshuffled_deduplicated_tk | 4694 | 752,326 | 6.8M |
Tuvinian | tyv | unshuffled_original_tyv | 34 | 759 | 12K | unshuffled_deduplicated_tyv | 24 | 540 | 7.9K |
Uighur | ug | unshuffled_original_ug | 22255 | 8,657,141 | 122M | unshuffled_deduplicated_ug | 15503 | 5,852,225 | 83M |
Ukrainian | uk | unshuffled_original_uk | 12973467 | 4,204,381,276 | 53G | unshuffled_deduplicated_uk | 7782375 | 2,252,380,351 | 28G |
Upper Sorbian | hsb | unshuffled_original_hsb | 7959 | 545,351 | 4.2M | unshuffled_deduplicated_hsb | 3084 | 236,867 | 1.8M |
Urdu | ur | unshuffled_original_ur | 638596 | 331,817,982 | 2.7G | unshuffled_deduplicated_ur | 428674 | 218,030,228 | 1.7G |
Uzbek | uz | unshuffled_original_uz | 27537 | 2,450,256 | 21M | unshuffled_deduplicated_uz | 15074 | 1,381,644 | 12M |
Venetian | vec | unshuffled_original_vec | 73 | 3,492 | 18K | unshuffled_deduplicated_vec | 64 | 3,199 | 17K |
Vietnamese | vi | unshuffled_original_vi | 14898250 | 12,036,845,359 | 68G | unshuffled_deduplicated_vi | 9897709 | 5,577,159,843 | 32G |
Volapük | vo | unshuffled_original_vo | 3366 | 321,121 | 2.0M | unshuffled_deduplicated_vo | 3317 | 318,568 | 2.0M |
Walloon | wa | unshuffled_original_wa | 1001 | 50,720 | 273K | unshuffled_deduplicated_wa | 677 | 37,543 | 203K |
Waray | war | unshuffled_original_war | 9760 | 397,315 | 2.5M | unshuffled_deduplicated_war | 9161 | 336,311 | 2.2M |
Welsh | cy | unshuffled_original_cy | 157698 | 37,422,441 | 213M | unshuffled_deduplicated_cy | 98225 | 23,574,673 | 133M |
Western Frisian | fy | unshuffled_original_fy | 33053 | 5,691,077 | 35M | unshuffled_deduplicated_fy | 20661 | 4,223,816 | 26M |
Western Mari | mrj | unshuffled_original_mrj | 757 | 93,338 | 1.2M | unshuffled_deduplicated_mrj | 669 | 87,780 | 1.1M |
Western Panjabi | pnb | unshuffled_original_pnb | 4599 | 1,426,986 | 12M | unshuffled_deduplicated_pnb | 3463 | 1,111,112 | 9.0M |
Wu Chinese | wuu | unshuffled_original_wuu | 214 | 11,189 | 109K | unshuffled_deduplicated_wuu | 64 | 4,333 | 32K |
Yakut | sah | unshuffled_original_sah | 22301 | 2,547,623 | 42M | unshuffled_deduplicated_sah | 8555 | 1,789,174 | 26M |
Yiddish | yi | unshuffled_original_yi | 59364 | 13,834,320 | 141M | unshuffled_deduplicated_yi | 32919 | 8,212,970 | 84M |
Yoruba | yo | unshuffled_original_yo | 214 | 8,906 | 55K | unshuffled_deduplicated_yo | 49 | 3,518 | 27K |
Yue Chinese | yue | unshuffled_original_yue | 11 | 186 | 3.7K | unshuffled_deduplicated_yue | 7 | 128 | 2.2K |
OSCAR was constructed new pipeline derived from the fastText's one , called goclassy . Goclassy reuses the fastText linear classifier and the pre-trained fastText model for language recognition, but it completely rewrites and parallelises their pipeline in an asynchronous manner.
The order of operations is more or less the same as in the fastText pre-processing pipeline but instead of clustering multiple operations into a single blocking process, a worker is launched for each operation but bounding the number of possible parallel operations at a given time by the number of available threads instead of the number of CPUs. Goclassy is implemented in the Go programming language so it lets the Go runtime handle the scheduling of the processes. Thus the goclassy's pipeline one does not have to wait for a whole WET file to download, decompress and classify in order to start downloading and processing the next one, a new file will start downloading and processing as soon as the scheduler is able to allocate a new process.
Filtering and cleaning processes at line level are done before feeding each line to the classifier. Lines shorter than 100 UTF-8 characters and lines containing invalid UTF-8 characters are discarted and are not classified. After all files are proccesed the deduplicated versions are constructed and everything is then splitted in shards and compressed.
Common Crawl is a non-profit foundation which produces and maintains an open repository of web crawled data that is both accessible and analysable. Common Crawl's complete web archive consists of petabytes of data collected over 8 years of web crawling. The repository contains raw web page HTML data (WARC files), metdata extracts (WAT files) and plain text extracts (WET files). The organisation's crawlers has always respected nofollow and robots.txt policies.
Each monthly Common Crawl snapshot is in itself a massive multilingual corpus, where every single file contains data coming from multiple web pages written in a large variety of languages and covering all possible types of topics.
To construct OSCAR the WET files of Common Crawl were used. These contain the extracted plain texts from the websites mostly converted to UTF-8, as well as headers containing the metatada of each crawled document. Each WET file comes compressed in gzip format and is stored on Amazon Web Services. In the case of OSCAR, the November 2018 snapshot was used. It surpasses 20TB of uncompressed data and contains more than 50 thousand plain text files where each file consists of the plain text from multiple websites along its metadata header.
Who are the source language producers?The data comes from multiple web pages in a large variety of languages.
The dataset does not contain any additional annotations.
Annotation processN/A
Who are the annotators?N/A
Being constructed from Common Crawl, Personal and sensitive information might be present. This must be considered before training deep learning models with OSCAR, specially in the case of text-generation models.
OSCAR is intended to bring more data to a wide variety of lanuages, the aim of the corpus is to make large amounts of data available to lower resource languages in order to facilitate the pre-training of state-of-the-art language modeling architectures.
OSCAR is not properly filtered yet and this can be reflected on the models trained with it. Care is advised specially concerning biases of the resulting models.
The fastText linear classifier is limed both in performance and the variety of languages it can recognize, so the quality of some OSCAR sub-corpora might be lower than expected, specially for the lowest-resource langiuages. Some audits have already been done by third parties .
The corpus was put together by Pedro J. Ortiz , Benoît Sagot , and Laurent Romary , during work done at Inria , particularly at the ALMAnaCH team .
These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France. Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: * Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. * Clearly identify the copyrighted work claimed to be infringed. * Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material. We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
@inproceedings{ortiz-suarez-etal-2020-monolingual, title = "A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages", author = "Ortiz Su{'a}rez, Pedro Javier and Romary, Laurent and Sagot, Benoit", booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.acl-main.156", pages = "1703--1714", abstract = "We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.", } @inproceedings{OrtizSuarezSagotRomary2019, author = {Pedro Javier {Ortiz Su{'a}rez} and Benoit Sagot and Laurent Romary}, title = {Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures}, series = {Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019}, editor = {Piotr Bański and Adrien Barbaresi and Hanno Biber and Evelyn Breiteneder and Simon Clematide and Marc Kupietz and Harald L{"u}ngen and Caroline Iliadi}, publisher = {Leibniz-Institut f{"u}r Deutsche Sprache}, address = {Mannheim}, doi = {10.14618/ids-pub-9021}, url = {http://nbn-resolving.de/urn:nbn:de:bsz:mh39-90215}, pages = {9 -- 16}, year = {2019}, abstract = {Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to it, this data lacks any information about the language in which each document is written, making it extremely difficult to use Common Crawl for monolingual applications. We propose a general, highly parallel, multithreaded pipeline to clean and classify Common Crawl by language; we specifically design it so that it runs efficiently on medium to low resource infrastructures where I/O speeds are the main constraint. We develop the pipeline so that it can be easily reapplied to any kind of heterogeneous corpus and so that it can be parameterised to a wide range of infrastructures. We also distribute a 6.3TB version of Common Crawl, filtered, classified by language, shuffled at line level in order to avoid copyright issues, and ready to be used for NLP applications.}, language = {en} }