数据集:
oscar
预印本库:
arxiv:2010.14571许可:
源数据集:
original批注创建人:
no-annotation语言创建人:
found计算机处理:
multilingualOSCAR or O pen S uper-large C rawled A LMAnaCH co R pus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. Data is distributed by language in both original and deduplicated form.
The version here is the original OSCAR 2019 release: https://oscar-project.org/post/oscar-2019/
For more recent versions, visit the oscar-corpus organization on the Hub:
OSCAR is mainly inteded to pretrain language models and word represantations.
All the data is distributed by language, both the original and the deduplicated versions of the data are available. 166 different languages are available. The table in subsection Data Splits Sample Size provides the language code for each subcorpus as well as the number of words (space separated tokens), lines and sizes for both the original and the deduplicated versions of OSCAR.
We show detailed information for all the configurations of the dataset.
An example of 'train' looks as follows.
{
"id": 0,
"text": "aanlyn markte as gevolg van ons voortgesette 'n begrip opsie handel sakeplan pdf terwyl ons steeds die gereelde ons binêre opsies handel"
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"De Nazionalpark hät e Flächi vo 170,3 km² und isch dodemit s grösti Naturschutzgebiet vo de Schwiz. Er ligt uf em Gebiet vo de ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"አየር መንገዱ ከአዲስ አበባ ወደ ሮም ጣሊያን በማምራት ላይ በነበረበት ጊዜ ረዳት አብራሪው የጉዞውን አቅጣጫ በመቀየር ጄኔቭ አውሮፓላን ማረፊያ በማሳረፍ እጁን ለፖሊስ ሰጥቷል።\\nየኢትዮጵያ መንግስት የ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"واااااااأسفاه الأمم تفتخر ب 0 أمي ووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووو..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"مرحبا بك عزيز الزائر نتمنى لك أوقاتاً سعيدة معنا وأن نزداد شرفا بخدمتك ولا تنسى التسجيل معنا لتستفيد بكل جديد\\nأهلا وسهلا بك زا..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"بنى عجل : قبيلة من عجل بن لجيم بن صعب بن على بن بكر بن وائل انتقل اغلبهم الى البصرة فى العراق و اصفهان و خراسان فى ايران و اذرب..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"আমি, এই সংগঠনৰ সদস্য সকলে একেলগ হৈ অসমকে ধৰি ভাৰতৰ উত্তৰ পূৰ্বাঞ্চলৰ অমূল্য কলা-সাংস্কৃতিক সম্পদৰাজি বৃহত্তৰ অষ্ট্ৰেলিয়াৰ সন্মু..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"The Killers llanzaron el so álbum debú, Hot Fuss, en xunu de 2004 nel Reinu Xuníu, al traviés de la discográfica Lizard King, y..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Жинда малъараб ва божизе бегьулеб рагІудаса кьуризе бегьуларо гьев. Гьес насихІат гьабизе кколелъул бацІцІадаб диналъул рахъалъ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"AZTV-Artıq 7 ildir ki, Abşeron rayonu dotasiya almadan bütün xərclərini yerli daxilolmalar hesabına maliyyələşdirir.\\nDünən, 10..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"لعلی ١٣-جو عصرده یاشاییب یاراتمیش گؤرکملی آذربایجان شاعرلریندندیر. ١٢٢٤-جی ایلده تبریزده آنادان اولموشدور، گنج یاشلاریندا تیجار..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Күҙәтеү ҡуласаһы моделен хәҙер Мифтахетдин Аҡмулла исемендәге Башҡорт дәүләт педагогия университетында ла эшләргә мөмкин\\t\\nКүҙ..."
}
An example of 'train' looks as follows.
{
"id": 0,
"text": " vo"
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"& ÿ ó / í 0 - ø û ù ö ú ð ï ú \\u0014 ù þ ô ö í ÷ ò \\u0014 ÷ í ù û ö í \\u0001 û ñ ç þ \\u0001 ð \\u0007 þ ò ñ ñ ò ô \\u0017 û ö ô ÷..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Брэсцкія ўлады не дазволілі прафсаюзу РЭП правесці пікетаванне ў парку Воінаў-інтэрнацыяналістаў 30 мая 2018 года.\\nСітуацыю пр..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ЖАЛБОПОДАТЕЛЯТ директор на Дирекция „ Обжалване и данъчно-осигурителна практика“- Бургас, редовно призован, се представлява от ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"सुकमा जिला भारत के छत्तीसगढ़ राज्य में एगो जिला बाटे। एकर मुख्यालय सुकमा शहर बाटे। एकर कुल रकबा 5636 वर्ग कि॰मी॰ बाटे।\"..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ভড়ং সর্বস্ব বাংলা আর্ট অ্যান্ড কালচারের হিসাব গুলিয়ে দেওয়ার ম্যাজিকের নাম ব্রাত্য রাইসু November 23, 2017\\nTagged with ডায়োজিনি..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"བོད་མི་འདི་དག་ནི་རང་རྒྱུད་སྒོ་རུ་ཕུད་དེ་གཞན་རྒྱུད་པང་དུ་ཉར་ནས་གསོ་སྐྱོང་བྱེད་དགོས་ཟེར་བ་དང་གཅིག་མཚུངས་རེད།\\nཚན་རིག་ནི་དང་ཐོག་རང..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"পৌরসভা এহার আয়তন (লয়াহান) ২,৭৩০,.৬৩ বর্গ কিলোমিটার। পৌরসভা এহার মাপাহানর অক্ষাংশ বারো দ্রাঘিমাংশ ইলতাই 18.63° S 48.18° W ।[১]..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Ar mank Magalhães(Daveoù a vank) a zo ur spesad evned, Spheniscus magellanicus an anv skiantel anezhañ.\\nGallout a reer implijo..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ž šř é ú šř šř ě šř ž é č ě ž ů ě ď éé ýš ě ě Ž č š ý ě ď é ýš ě ď ě éé ýš ě č ž ě š ý ď ě ýš é ú č ž č š ý ď ý ž é éě ď é č ýš..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"2002 оной хабар буряад хэлэ бэшэгэй һалбари Үндэһэтэнэй хүмүүнлиг ухаанай дээдэ һургуули болгогдожо өөршэлэгдөө.\\nХарин мүнөө б..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Daniel Vendrell, conegut com Vandrell, ha sigut un dels il•lustradors contemporanis més influents, representant a la nova onada..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Шаьш анархисташ ду бохучу жигархойн дIахьедарехь дуьйцу, оьрсийн ницкъаллийн структурийн а, федералан каналан а Iалашонаш \\\"мар..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Si Isko walay pupamilok nga nagtan-aw sa unahan, natugaw. “Naunsa ka gud diha Isko nga layo man kaayo ang imong panan-aw?” ni I..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"رسی رۆژ - ساڵێک دوای بومەلەرزەی کرماشان میوانی بەرنامە : کاک سیاوەش حەیاتی چالاکی مەدەنی -قەسری شیرین\\nپارچە موزیک 30 / 10 / 20..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Akce anarchistů proti připravovanému novému služební řádu a nízkým mzdám 1903 – Historie českého anarchismu (1880 – 1939)\\nRost..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Шыранӑ чухне ӑнсӑртран латин кирилл саспаллисем вырӑнне латин саспаллисене ҫырсан, сайт эсир ҫырнине юсама тӑрӑшӗ.\\nКу сайтра ч..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Mae capeli Cymreig yr Andes ym Mhatagonia wedi cyhoeddi na fydd gwasanaethau yno weddill y mis, oherwydd yr eira trwm sydd wedi..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Den 2.-5. februar 2016 løb det tredje kursus i uddannelsen af 4kommunesamarbejdets Local Impact Coaches, af stablen i Gentofte ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Auf dieser Seite gibt es mind. ein YouTube Video. Cookies für diese Website wurden abgelehnt. Dadurch können keine YouTube Vide..."
}
An example of 'train' looks as follows.
{
"id": 0,
"text": "Zıwanê Slawki, zıwano merdumanê Slawano. Zıwanê Slawki yew lızgeyê Zıwananê Hind u Ewropao. Keyeyê Zıwananê Slawki beno hirê letey:"
}
An example of 'train' looks as follows.
{
"id": 1,
"text": "Pśiklaskaju južo pśed pśedstajenim... 1500 źiśi njamóžo wěcej docakaś, měsćańska hala w Chóśebuzu - wupśedana."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ބ. އަތޮޅުގައި ހުޅުވަން ތައްޔާރުވަމުން އަންނަ ވައްކަރު ރިސޯޓުގައި ވަޒީފާ އަދާކުރަން ޝައުގުވެރިވާ ފަރާތްތަކަށް ކުރިމަތިލުމުގެ ފުރ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Νεκρός εντοπίστηκε μέσα στο σπίτι του στην οδό Ηρώδου Αττικού στον αριθμό 7 ο επικεφαλής του προξενικού τμήματος της Ρωσικής πρ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"A séguit dal prucès ad rubutiśasiòṅ di abitànt dal pòpul ad Mikenes, Angoras 'l è finî dènt'r a 'n robot cun la tèsta dna rana ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, which he shared with John Blanchard during his first visi..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Ĉu ... preĝi | mediti | ricevi instigojn || kanti | muziki || informiĝi | legi | studi || prepari Diservon\\nTemas pri kolekto d..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Como se librará de la celulitis en el gimnasio La piel superflua en las manos después del adelgazamiento, Los bailes fáciles pa..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"MTÜ AB Video järgib oma tegevuses kodanikuühenduste eetilise tegevuse üldtunnustatud põhimõtteid, mis on lühidalt kokkuvõetud 7..."
}
An example of 'train' looks as follows.
{
"id": 0,
"text": "Gure jarduerek eraikuntzarekin, elkarbizitzarekin, hirigintzarekin eta ekologiarekin dute harremana, baita ideia eta konponbideak irudikatu eta garatzearekin ere, eraikuntza sektorea hobetuz, pertsonen erosotasuna eta bizi-kalitatea hobetzeko."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"قـــــــــــــــــرار بود با هم کنـــــــــــــار بیایم نه اینکه از کنــــــــــــار هم رد بشیم...!!!\\nاگر روزی دلت لبریز غم بو..."
}
An example of 'train' looks as follows.
{
"id": 1,
"text": "Kiitos Deelle kaikesta - 1,5 viikkoa kulunut, kun Dee ei ole enää ollut omani. Reilu viikko sitten sunnuntaina vein Deen uuteen kotiinsa. Itselläni on ollut niin ristiriitaiset t..."
}
An example of 'train' looks as follows.
{
"id": 0,
"text": "Média de débat d'idées, de culture et de littérature. Récits, décryptages, analyses, portraits et critiques autour de la vie des idées. Magazine engagé, ouvert aux autres et au monde.. Bring up to date in french"
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Hiragana’ Practice’Sheet’1’(A -O)’ ’ Name:’________ __________________________’Section:’_______________ _’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ..."
}
An example of 'train' looks as follows.
{
"id": 1,
"text": "Nim in sêfte ride op Holmsjön, yn ien fan 'e lytse marren yn de omkriten, of nim se op avontueren lykas nonresidential. lâns Indalsälven wetter. Holm Sportklubb hawwe kano 's te huur, yn gearwurking mei de Baltyske Power konferinsje."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Is fóram é seo chun plé a dhéanamh ar an leabhar atá roghnaithe do mhí na Samhna 2013 amháin. Ní féidir ach le baill chláraithe..."
}
An example of 'train' looks as follows.
{
"id": 0,
"text": "Zhou Yujun, a 'phàrtaidh Rùnaire Comataidh Sgìre Yanfeng ann Hengyang bhaile agus a Sgìre pàrtaidh agus an riaghaltas a' bhuidheann-riochdachaidh a 'tighinn a chèilidh air ar companaidh air Apr. 14, 2017."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"O persoal de Inditex da provincia de Pontevedra segue a reclamar iguais condicións laborais no conxunto do país - CIG: Confeder..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"º ÑÆÚÓ À Ã Ð É Æ ¾ ÄÂ Î À ¼ Æ É ÄÛ = Ü Ý\\\"Þ ßà á â ã ä å æçè ã é ê â å àë ì æê íî é á ë ï í çì àð í Ü à ñ ê é ò ä ì\"..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"दुष्ट शीळ हें कौरवांचें । रामें सविस्तर देखूनि साचें । बोलिले वचनें जें दुर्वाचे । करी तयांचें अनुस्मरण ॥२२०॥\"..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"અધિક માસ ચાલે છે. સમગ્ર ભારતમાં અને તેમાંય ખાસ કરીને પવિત્ર કે ધાર્મિક કહેવાય છે તેવા સ્થાનક પર કથાનો દોર ચાલે છે. ઉનાળાની કાળઝ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"זקוקים לרשתות נגד יתושים? מחפשים רשת מתאימה לחלון צר וקטן? רשתות נגד יתושים אקורדיון של חברת קליר-מש הן הפתרון.\\nרשתות לחלונות ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"'आइटम गर्ल' बनकर हिट हुई थीं राखी सावंत, आज करीना-कटरीना तक फॉलो कर रही हैं ट्रेंड नक्सलियों का दम निकालेगा बाइक ग्रेनेड लॉन्च..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"U raspravi je sudjelovao i HSS-ov saborski zastupnik rekavši kako poljoprivrednici ne osjete mjere o kojima ministar govori jer..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Budyšin (SN/BŠe). Elektronikarjo mějachu lětsa cyle hinaši zazběh do swojeho wukubłanja. Wokrjesne rjemjeslnistwo bě mjenujcy w..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"monster - Amatőr, házi szex videók és kezdő csjaok pornó filmjei. - Free amateur, home made sex videos and online porn movies. ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Արցախի Հանրապետության հռչակման 26-րդ տարեդարձի կապակցությամբ Շուշիի Արվեստի կենտրոնում կազմակերպվել է մոսկվաբնակ նկարիչներ՝ հայ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha h..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Perihal dari itu, kalau kunci hal yang demikian hilang, pemilik wajib melapor ke bengkel sah untuk dibuatkan kunci baru dengan ..."
}
An example of 'train' looks as follows.
{
"id": 0,
"text": "Plastic Yo Yo Metal Yo Yos Wooden Yo Yo Keychain Yo Yo Translucent Yo Yo Light Up Yo Yo Globe Yo Yo Stress Reliever Yo Yo Jellyfish Yo Yo Sports Ball Yo Yo Sound Yo Yo Miniature Yo Yo Promotional Yo Yo Novelty Yo Yo Video Game Yo Yo ECO Recycled Yo Yo"
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Segun ken ni Ping-ay, ti yellow corn ti maysa kadagiti nadakamat a liberalized agricultural commodity iti daytoy a free trade k..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Chekia esas parlamentala republiko. La chefo di stato esas la prezidanto. Til 2013 lu elektesis dal parlamento. Pos ta yaro, ol..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Eyjar.net - upplýsinga- og fréttamiðill um Vestmannaeyjar - Fréttir - Nái núverandi stefna stjórnvalda fram að ganga mun það va..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Jaundice - causes, treatment & pathology massaggio a osteochondrosis dellindizio di una controindicazione\\nTrattamento su un co..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"神社などへ一緒に同行して、様々な角度のショットで家族写真やお子様の写真を撮影致します!お好みに合わせて様々な写真を取ることができますので、その場でカメラマンへのリクエストも可能です!お子様の晴れ姿を、緊張していない自然な笑顔で残しませんか?\\n※七五三の..."
}
An example of 'train' looks as follows.
{
"id": 1,
"text": "ni'o 23 la cimast. cu 23moi djedi fi'o masti la cimast. noi ke'a cu cimoi masti .i 22 la cimast. cu purlamdei .ije 24 la cimast. cu bavlamdei"
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"José Mourinho (diwaca: [ʒuˈzɛ moˈɾiɲu]; lair ing Setubal, Portugal, 26 Januari 1963; umur 55 taun) iku salah siji pelatih bal k..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"წამიყვანე შენთან ერთად (ქართულად) / Возьми меня с собой (картулад) / (რუსული სერიალები ქართულად) (რუსების პორნო ონლაინში) (ruse..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Түлкібас ауданында «Латын негізді әліпби мен емле ережесі туралы насихат» жобасының тобы семинар өткізді\\nЕлорданың «Қазақстан»..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ខ្សឹបដាក់ត្រចៀក៖ លោក សួស សុផានិត នាយផ្នែករដ្ឋបាលព្រៃឈើ ស្រុកភ្នំក្រវាញ់ ដែលទើបឡើងកាន់តំណែងថ្មី បើកដៃឲ្យឈ្នួញ ប្រព្រឹត្តបទល្មើស ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ರಾಷ್ಟ್ರಪತಿ ಪ್ರಣಬ್ ಮುಖರ್ಜಿಯಿಂದ ಪದ್ಮ ಪ್ರಶಸ್ತಿ ಪ್ರದಾನ | President Pranab Mukherjee Confers Padma Awards | Photo Gallery on Kannada..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"CIA 프로젝트에서는 데이터베이스로 들어오는 요청을 중간에 수집(Sniffing)하고 수집한 데이터를 분석(Parsing)하여 그로 인한 결과를 판단하여 알릴 수 있는 시스템(Push Service)이 필요하다. 그리고 연구를 ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Шамханланы, Бийлени къаршысына ябушуп, Батыр уланларыбызны къоллары булан «ортакъ ожакъ» къургъанбыз. Шо иш уллу зараллы иш бол..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Me di 114 bernameyên xwe yên berê da perçeyên ji berhemên zanyarî yên kurdzanên mezin bi wergera kurdî da ...\\nMe di 114 bernam..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Коми кытшыслӧн ыджытжык тор вӧр увтын куйлӧ, сійӧн и фаунасӧ татӧн аркмӧтӧны вӧрын олісь подаэз. Ассямаӧн лоӧ сія, мый кытшас с..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"????????????????????????????????????????????????????????????????????Pray without ceasing???????????????????????????????????????..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Turmush: Бишкек шаардык кеңешинин кезексиз отурумунда мэрге ишенбөөчүлүк көрсөтүү маселеси каралат, - депутат Т.Сагынов\\nБишкек..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Hæ sunt generationes Noë: Noë vir justus atque perfectus fuit in generationibus suis; cum Deo ambulavit.\\nEcce ego adducam aqua..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Während dem Gaardefestival \\\"Ambiance Jardins\\\" vum 15. bis de 17. Mee huet den SNJ nees zesumme mam Groupe Animateur en Inform..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Ахцегь хуьр, виридалай ч1ехи лезги хуьрерикая я. Ам Урусатдин виридалай къиблепатавай хуьрерикай я. Ин хуьр...\"..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"'t Good Goedenraad aan de Ezerbaek besjteit oet 'n kesjtièl mèt gesjlote haof en 'n park van 26 hectare. Hie in sjtoon väól beu..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Serét (en tortonés: Sregh; en piemontés: Srèj) l'è 'n cümü italià, de la regiù del Piemónt, en Pruvìncia de Alessandria. El g'h..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"ຜູ້ພິພາກສາ ປະຈຳເຂດ ສຫລ ທ່ານນຶ່ງ ຕັດສິນວ່າ ໂຄງການເກັບກຳຂໍ້ມູນ ທາງໂທລະສັບ ຂອງອົງການ ຄວາມໝັ້ນຄົງແຫ່ງຊາດ ແມ່ນຖືກຕ້ອງ ຕາມກົດໝາຍ.\\nກະ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"آرلینگتون یئ گئل د شأریا ڤولاتچە ڤیرجینیا و یئ گئل د شأریا ڤولات ڤولاتچە یا یأکاگئرئتە ئمریکاە. ئی شأر دویومی کألوٙن شأر د راسا..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Čir vir vir pavasaris! Čia čia čia… dalinamės labai simpatiška video pamokėle, kurią pristato ab888art galerija.\\nBe galo papra..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Dekoratīvi sliekšņi MITSUBISHI OUTLANDER 2007, izgatavoti no ovālas formas, pulētas nerūsējošā tērauda caurules...\\ndažādas tūn..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"१ · २ · ३ · ४ · ५ · ६ · ७ · ८ · ९ · १० · ११ · १२ · १३ · १४ · १५ · १६ · १७ · १८ · १९ · २० · २१ · २२ · २३ · २४ · २५ · २६ · २७ · २..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Nanamboatra taratasy apetaka sy soso-kevitra ho an'ny olona te-hanatevin-daharana ity fihetsiketsehana ity i Anocrena.\\nNosorat..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Акрет жап годым Уганда кундемым Пигмей племена- влак айлен шогеныт. мемнан эран 1 курым гыч Банту племена влакат тиде кундемышк..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\" ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"„Филм плус“ е насловен првиот филмски месечник во Македонија, чиј прв број ќе биде промовиран вечер во „Менада“. Новото македон..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"സ്ത്രീ പ്രവേശനം സര്ക്കാര് പൂര്ണമായും അംഗീകരിക്കുന്നുവെന്നും ശബരിമലയുടെ സുരക്ഷയില് ഇടപെടുമെന്നും സര്ക്കാര് ഹൈക്കോടതിയില്\\..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"МУБИС-ын багш мэргэжлийн хөрвөх сургалтыг төгссөн багшид багшлах эрх олгох тухай ~ БМДИ-ийн захирлын тушаал - Багшийн мэргэжил ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Home / motivational marathi story / उद्योजकता (Entrepreneurship) / यांना हे जमलय, तर आपल्याला का नाही जमणार ?\\nयापैकी कोणाचीही ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Лӹпӹвлӓ (латинлӓ Lepidoptera ; алыкмарла лыве-влак) — капшангывлӓ йыхыш пырышы сӱмӓн нӹл шылдыран капшангывлӓ. Цилӓжӹ 180000 тӹ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Sanad pertama daripada Zuhair bin Harb daripada ‘Affan daripada Hammad daripada Thabit daripada Anas.\\nSanad kedua daripada ‘Ab..."
}
An example of 'train' looks as follows.
{
"id": 0,
"text": "tibgħat il-kawża lura lill-Qorti Ġenerali għall-annullament jew għat-tnaqqis tal-penalità imposta mill-Kummissjoni bid-deċiżjoni inizjali kif emendata bid-deċiżjoni ta’ rettifika;"
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Deciplina social i outónoma que angloba atebidades de ouserbaçon, de análeze, de çcriçon, cumparaçon, de sistematizaçon i de sp..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ျမ၀တီ - ရန္ကုန္တိုင္းေဒသႀကီး ေျမာက္ဥကၠလာပႏွင္႕ ဗဟန္းၿမိဳ႔နယ္ မေကြးတိုင္း ေဒသႀကီး ပခုကၠဴၿမိဳ႔နယ္တို႔၌ ျမန္မာ႕တပ္မေတာ္အား ေထာက္ခံ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"2018 иень умарьковонь 6-це чистэ сась паро куля! Россиянь культурань Министерствась макссь невтемань конёв (прокатной удостовер..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"قرآن یا قوران اسلام ِآسمونی کتاب هسته. مسلمونون گانّّه قرآن ره خدا، وحی جه برسنییه، «محمد معجزه» هسته و ثقلین حدیث دله ونه خَو..."
}
An example of 'train' looks as follows.
{
"id": 0,
"text": "In mācuīlpōhualxihuitl VI (inic chicuacē) in mācuīlpōhualli xiuhitl cāhuitl īhuīcpa 501 xihuitl oc 600 xihuitl."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ò AUDIT í Ç è î ÿ å å 30 ò ÿ ÿ é, õ ñ ì ÿ, ê ã- ò à ì. å â å í ç â à à é ñ è å é ó ó ë. å å å û è å î é è à. à è à AUDIT 1-7 â ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Dor kann sik vun nu af an de hele plattdüütsche Welt – vun Niebüll bit New York, vun Helgoland bit Honolulu – drapen. Allens, w..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"बर्दिबास नगरपालिकाको तेस्रो नगर परिषदबाट पारित आ.व.२०७३।७४ को संशोधित र २०७४।७५ को प्रस्तावित नीति, कार्यक्रम तथा बजेट\\nअार्थिक..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"थ्व शहरयागु अक्षांश ३४.७००१६४ उत्तर व देशान्तर ८६.३७६४६९ पश्चिम खः (34.700164° N 86.376469° W)। थ्व थासे ७२२६७३२ वर्ग मिटर (२.७..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Op vrijdag 31 augustus wordt het nieuwe studiejaar van de masteropleiding architectuur geopend met een dagexcursie naar Venlo.\\..."
}
An example of 'train' looks as follows.
{
"id": 0,
"text": "Planomtale krav til innhald Bakgrunn: Spørsmål frå fleire kommunar om kva ein planomtale/planbeskrivelse bør innehalde Fylkeskommunen og fylkesmannen har i ein del saker reist motsegn på formelt grunnlag"
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Ytterligere aktører i primærhelsetjenesten og andre NHS-virksomheter ble infisert, inkludert legekontor.Læreren vår er så attra..."
}
An example of 'train' looks as follows.
{
"id": 1,
"text": ".рф (rf, còdi punycode: .xn--p1ai)[1] es lo nom de domeni en rus per Russia. Foguèt activat lo 12 de mai de 2010. Lo còdi latin es .ru."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ଭୁବନେଶ୍ୱର, ୨୭/୧– (ଓଡ଼ିଆ ପୁଅ) ସିପିଆଇ ଜାତୀୟ ପରିଷଦର ଆହ୍ୱାନକ୍ରମେ ଗତକାଲି ଜାନୁୟାରୀ ୨୬ ସାଧାରଣତନ୍ତ୍ର ଦିବସକୁ ଦେଶ ବ୍ୟାପୀ ସମ୍ବିଧାନ ସୁରକ୍ଷା ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"1. Лæппу æмæ чызг казрæдзийы зæрдæмæ куы фæцæуынц æмæ, куы сфæнд кæнынц сæ цард баиу кæнын, уæд лæппу бар ракуры чызгæй, цæмæй ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ਰਜਿ: ਨੰ: PB/JL-138/2018-20 ਜਿਲਦ 63, ਬਾਨੀ ਸੰਪਾਦਕ (ਸਵ:) ਡਾ: ਸਾਧੂ ਸਿੰਘ ਹਮਦਰਦ ਫ਼ੋਨ : 0181-2455961-62-63, 5032400, ਫੈਕਸ : 2455960, 2..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Áku pu i Anak ning Aláya at ngeni ipákit kó kékayu ngan nûng makanánu lang susúlat détinang kulit a mágkas. Lauan ya ing tarátu..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"System informatyczny - Załącznik nr 1 do zarządzenia Wójta Gminy Podegrodzie Nr 530/2013 z dnia 27 maja 2013 r\\nSystem informat..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Louvigné-du-Désert a l'é na comun-a fransèisa ant la region aministrativa dla Brëtagna, ant ël dipartiment d'Ille-et-Vilaine. A..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"ایہ فائل Wikimedia Commons توں اے تے دوجیاں ویونتاں تے وی ورتی جاےکدی اے۔ گل بات اس دے فائل گل بات صفہ تے تھلے دتی گئی۔\"..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Many people usually use the time period ‘business to business (B2B) advertising,’ however most of them do not know precisely wh..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Você pode estar lendo este texto no sofá, levantar pra pegar uma breja na geladeira, dar uma cagada e sentar novamente, sem int..."
}
An example of 'train' looks as follows.
{
"id": 1,
"text": "Warayu wichay (kastilla simipi: Ascensión de Guarayos) nisqaqa Buliwya mama llaqtapi, Santa Krus suyupi, huk llaqtam, Warayu pruwinsyap uma llaqtanmi."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"practicists agrars / practicistas agraras AFP pon far ina furmaziun da basa scursanida per cuntanscher in attestat federal da q..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"“În viață, oportunitatea nu este totul. Cine atrage Lumina, cineva bun în umbră. Timpul ne creează.” maestru\\nLyn.Evans: Ce mar..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Доступ к данному профилю для публичного просмотра закрыт администрацией сайта - профиль находится на модерации.\\nРазработчикам ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"अनिरुद्धनगरे क्रीडिता रामलीला सम्प्रति समाप्ता अस्ति । तस्य कानिचन् चित्राणि पूर्वमेव प्रकाशितानि सन्ति । द्वौ चलचित्रौ अपि ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████..."
}
An example of 'train' looks as follows.
{
"id": 0,
"text": "La gilusìa è nu sintimentu dulurusu ca nasci d'un disideriu di pussessu sclusivu ntê cunfrunti dâ pirsuna amata e dû timuri, dû suspettu o dâ cirtizza dâ sò nfidiltati."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"هر ڪو ڄاڻي ٿو ته جڏهن توهان هڪ وڏي خريد ڪرڻ چاهيون ٿا, توهان پڄي ضروري حڪم ۾ ان جي ڪم ڪرڻ جي هٿ ۾ لاڳاپو ڪيو آهي. جي شيء آهي ته..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Opština Gornja Radgona se nalazi u sjeveroistočnoj Sloveniji i graniči s susjednom Austriji duž rijeke Mure. Sa tridesetim nase..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"ලාංකීය සිතිවිලි සිංහල බ්ලොග් කියවනය කොත්තු සින්ඩිය ලංකා Blogger හත්මාළුව ලංකා බ්ලොග් කියවනය මාතලන්ගේ සින්ඩිය මොබයිල්lk\\nඅවකාශය ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Aktivity | Agentúra podporovaného zamestnávania | vzdelávanie pre klientov, vzdelávanie pre odborníkov, kurzy\\nŠpecializované k..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Če Creatures, ki je želel, da pridejo na čas, predvsem je povedlo – razlikuje od ljubosumja začel grizenja kolen (ali zadnjica)..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"тттттттттттттттттттттттттттттттт тттттттттттттттттттттттттттттттт тттттттттттттттттттттттттттттттт ттттттттттттттттуууууууууууу..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Çfarë do të më pëlqente tek një femër ose çfarë do të më shndërronte në një shpërthim drite? – Albert Vataj\\nTë gjithëve një zo..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Корисни савети за сваки дан. На сајту су разне категорије, као што су љепота, мода, кување и поправка властитим рукама.\\nШколск..."
}
An example of 'train' looks as follows.
{
"id": 1,
"text": "Kartu krédit nyaéta \"duit plastik\" anu dikaluarkeun ku bank pikeun alat pambayaran di tempat-tempat nu tangtu samisal jiga di hotél, réstoran, tempat rékréasi jeung sajabana.[1]"
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"1783 är ett viktigt årtal i den nya tidens historia. Det året slöts en fred i Paris och därmed blev de 13 brittiska kolonierna ..."
}
An example of 'train' looks as follows.
{
"id": 1,
"text": "Miripuko hiyo inakuja mwanzoni mwa Wiki Takatifu kuelekea Pasaka na ikiwa ni wiki chache tu kabla ya Papa Francis kuanza ziara yake katika nchi hiyo yenye idadi kubwa kabisa ya watu katika ulimwengu wa nchi za Kiarabu."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"பொழுது சாய்ந்து வெகு நேரமாகிவிட்டது. கூலி வேலைக்குப் போயிருந்த 'சித்தாள் ' பெண்கள் எல்லோரும் வீடு திரும்பி விட்டார்கள். இன்னும்..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"హర్యానాలో టోల్ దగ్గర సిబ్బంది.. స్థానిక ప్రజలు కొట్టుకున్నారు. కర్నాల్ అనే గ్రామానికి సమీపంలో టోల్ గేట్ ఉంది. అయితే సాధారణంగా స..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Ҳумайро гуфтааст, мухолифи низом аст, низоме, ки дар Тоҷикистон вуҷуд дорад. Ба ин маънӣ, худро мухолифи давлату ҳукумати Тоҷик..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ฟันที่แลดูขาวสะอาดไม่มีเศษอาหารติดอยู่ เหงือกสีชมพู ไม่เจ็บ หรือมีเลือดออกเวลาแปรงฟันหรือขัดฟัน ไม่มีปัญหาเรื่องกลิ่นปาก ทำให้ก..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Türkmenistanyň Prezidenti agyr atletika boýunça dünýä çempionatyna taýýarlyk işleriniň barşy bilen tanyşdy\\nHalallykdan kemal t..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"“Gusto ko manawagan sa mga Unit Head ng Chanel 2 Salve. Kasi napapansin ko iyon mga alaga ko ang taping halos once a week lang,..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Son yıllarda görülen ay tutulmalarına göre daha etkili olacağı söylenen Kanlı veya Kırmızı Ay Tutulmasına saatler kaldı. Bu akş..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"\\\"Иремнең вафатына 40 көн узгач, Алмаз да безнең өйгә кереп үлде\\\". Арчада 35 яшьлек ир өстенә кондызлар ега башлаган агач төшк..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Экии, хүндүлуг аалчылар болгаш тыва дылдың деткикчилери! Тыва дылдың болгаш чогаалдың ховар бир башкызынга, Менги Ооржакка, ажы..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"زاڭ-ءتۇزىم | عىلىم-تەحنيكا | ءتىل-ادەبيەت | تۇرمىس | دەنە تاربيە | ساياحات-ورتا | سۋرەتتى حابار | سىر سۇحبات | ارناۋلى تاقىرىپ ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Про надання роз'яснення (щодо форми письмового зобов'язання громадян про зворотне ввезення/вивезення товарів), Державна митна с..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"آئیے اہم اسلامی کتب کو یونیکوڈ میں انٹرنیٹ پر پیش کرنے کے لئے مل جل کر آن لائن ٹائپنگ کریں۔ محدث ٹائپنگ پراجیکٹ کے ذریعے آپ روز..."
}
An example of 'train' looks as follows.
{
"id": 1,
"text": "Qurama tog'lari tizmasining Toshkentdan 154 km uzoqlikdagi Toshkent-Ush yo'li yeqasidaxushmanzara tabiat qo'ynida joylashgan maydoni 30 ga.\nBolalarni sog'lomlashtirish oromgohi Bo'stonliq tumani Oqtosh muntaqasining soy-salqin gushasida joylashgan."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Par ogni pónto, ła derivada ła xe ła pendensa de ła reta tangente a ła curva de ła funsion f. Ła reta de cołor róso l'è senpre ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Canh chua cá bông lau không chỉ là món ăn giải nhiệt, thanh mát ngày hè mà còn là món siêu bổ dưỡng, rất tốt cho người gầy ốm. ..."
}
An example of 'train' looks as follows.
{
"id": 1,
"text": "Sarniguet binon zif in ziläk: Hautes-Pyrénées, in topäd: Midi-Pyrénées, in Fransän. Sarniguet topon videtü 43°19’ 7’’ N e lunetü 0°5’ 19’’ L."
}
An example of 'train' looks as follows.
{
"id": 1,
"text": "Cisse pådje ci n' est co k' on djermon, dj' ô bén k' el pådje est djusse sibåtcheye, eyet co trop tene; et s' divreut ele ecråxhî ene miete."
}
An example of 'train' looks as follows.
{
"id": 1,
"text": "An Honce amo in usa ka baryo ngan munisipalidad ha distrito han Rožňava ha rehiyon han Košice ha nasod han Slovakia.\nAn Rumegies amo in usa ka komyun ha departamento han Nord ngan ha rehiyon han Nord-Pas-de-Calais ha nasod han Fransya."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"伊春元旦天气 伊春腊八天气 伊春春节天气 伊春情人节天气 伊春元宵节天气 伊春愚人节天气 伊春清明节天气 伊春劳动节天气 伊春母亲节天气 伊春端午节天气 伊春七夕节天气 伊春教师节天气 伊春中秋节天气 伊春国庆节天气 伊春重阳节天气 伊春万圣节天气 伊春..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Арнгудин Орн гисн Европд бәәдг һазр. 2007 җилин тooһaр эн орн нутгт 3,600,523 әмтн бәәдг билә. Арнгудин Орнин хотл балһсна нерн..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"მოჩამილი ტექსტი წჷმორინელი რე Creative Commons Attribution-ShareAlike ლიცენზიათ; შილებე გეძინელი პირობეფიშ არსებუა. კილიშკილიშა..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ממשותדיק - חבֿרה, איך אַרבעט איצט אױף אַ זשורנאַל. טאָמער איר האָט עפּעס צוצוגעבן זאָלט איר שיקן מיר אַן אָנזאָג. ס'װעט הײסן \\\"..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Copyright © 2018 BBC. BBC kò mọ̀ nípa àwọn ohun tí ó wà ní àwọn ojú òpó tí ó wà ní ìta. Ọwọ́ tí a fi mú ìbáṣepọ̀ ti ìta.\"..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 你還不爆 我累了 投降輸一半可以嗎\"..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"中国铝灰网 中国有色金属矿产网 中国黄莲网 中国水轮发电机网 中国抽油泵网 中国数控雕刻机网 中国不锈钢抛光网 中国磨具加工网 中国压铸铝网 中国耐水腻子网 中国手机摄像头网 中国粗粮网 中国车门锁网 中国钛粉网 中国轮圈网\\n天天中奖彩票图 天天中彩票..."
}
An example of 'train' looks as follows.
{
"id": 0,
"text": "aanlyn markte as gevolg van ons voortgesette 'n begrip opsie handel sakeplan pdf terwyl ons steeds die gereelde ons binêre opsies handel"
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"De Nazionalpark hät e Flächi vo 170,3 km² und isch dodemit s grösti Naturschutzgebiet vo de Schwiz. Er ligt uf em Gebiet vo de ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"አየር መንገዱ ከአዲስ አበባ ወደ ሮም ጣሊያን በማምራት ላይ በነበረበት ጊዜ ረዳት አብራሪው የጉዞውን አቅጣጫ በመቀየር ጄኔቭ አውሮፓላን ማረፊያ በማሳረፍ እጁን ለፖሊስ ሰጥቷል።\\nየኢትዮጵያ መንግስት የ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"واااااااأسفاه الأمم تفتخر ب 0 أمي ووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووووو..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"مرحبا بك عزيز الزائر نتمنى لك أوقاتاً سعيدة معنا وأن نزداد شرفا بخدمتك ولا تنسى التسجيل معنا لتستفيد بكل جديد\\nأهلا وسهلا بك زا..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"بنى عجل : قبيلة من عجل بن لجيم بن صعب بن على بن بكر بن وائل انتقل اغلبهم الى البصرة فى العراق و اصفهان و خراسان فى ايران و اذرب..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"আমি, এই সংগঠনৰ সদস্য সকলে একেলগ হৈ অসমকে ধৰি ভাৰতৰ উত্তৰ পূৰ্বাঞ্চলৰ অমূল্য কলা-সাংস্কৃতিক সম্পদৰাজি বৃহত্তৰ অষ্ট্ৰেলিয়াৰ সন্মু..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"The Killers llanzaron el so álbum debú, Hot Fuss, en xunu de 2004 nel Reinu Xuníu, al traviés de la discográfica Lizard King, y..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Жинда малъараб ва божизе бегьулеб рагІудаса кьуризе бегьуларо гьев. Гьес насихІат гьабизе кколелъул бацІцІадаб диналъул рахъалъ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"AZTV-Artıq 7 ildir ki, Abşeron rayonu dotasiya almadan bütün xərclərini yerli daxilolmalar hesabına maliyyələşdirir.\\nDünən, 10..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"لعلی ١٣-جو عصرده یاشاییب یاراتمیش گؤرکملی آذربایجان شاعرلریندندیر. ١٢٢٤-جی ایلده تبریزده آنادان اولموشدور، گنج یاشلاریندا تیجار..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Күҙәтеү ҡуласаһы моделен хәҙер Мифтахетдин Аҡмулла исемендәге Башҡорт дәүләт педагогия университетында ла эшләргә мөмкин\\t\\nКүҙ..."
}
An example of 'train' looks as follows.
{
"id": 0,
"text": " vo"
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"& ÿ ó / í 0 - ø û ù ö ú ð ï ú \\u0014 ù þ ô ö í ÷ ò \\u0014 ÷ í ù û ö í \\u0001 û ñ ç þ \\u0001 ð \\u0007 þ ò ñ ñ ò ô \\u0017 û ö ô ÷..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Брэсцкія ўлады не дазволілі прафсаюзу РЭП правесці пікетаванне ў парку Воінаў-інтэрнацыяналістаў 30 мая 2018 года.\\nСітуацыю пр..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ЖАЛБОПОДАТЕЛЯТ директор на Дирекция „ Обжалване и данъчно-осигурителна практика“- Бургас, редовно призован, се представлява от ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"सुकमा जिला भारत के छत्तीसगढ़ राज्य में एगो जिला बाटे। एकर मुख्यालय सुकमा शहर बाटे। एकर कुल रकबा 5636 वर्ग कि॰मी॰ बाटे।\"..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ভড়ং সর্বস্ব বাংলা আর্ট অ্যান্ড কালচারের হিসাব গুলিয়ে দেওয়ার ম্যাজিকের নাম ব্রাত্য রাইসু November 23, 2017\\nভড়ং সর্বস্ব বাংলা আর..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"བོད་མི་འདི་དག་ནི་རང་རྒྱུད་སྒོ་རུ་ཕུད་དེ་གཞན་རྒྱུད་པང་དུ་ཉར་ནས་གསོ་སྐྱོང་བྱེད་དགོས་ཟེར་བ་དང་གཅིག་མཚུངས་རེད།\\nཚན་རིག་ནི་དང་ཐོག་རང..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"পৌরসভা এহার আয়তন (লয়াহান) ২,৭৩০,.৬৩ বর্গ কিলোমিটার। পৌরসভা এহার মাপাহানর অক্ষাংশ বারো দ্রাঘিমাংশ ইলতাই 18.63° S 48.18° W ।[১]..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Ar mank Magalhães(Daveoù a vank) a zo ur spesad evned, Spheniscus magellanicus an anv skiantel anezhañ.\\nGallout a reer implijo..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ž šř é ú šř šř ě šř ž é č ě ž ů ě ď éé ýš ě ě Ž č š ý ě ď é ýš ě ď ě éé ýš ě č ž ě š ý ď ě ýš é ú č ž č š ý ď ý ž é éě ď é č ýš..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"2002 оной хабар буряад хэлэ бэшэгэй һалбари Үндэһэтэнэй хүмүүнлиг ухаанай дээдэ һургуули болгогдожо өөршэлэгдөө.\\nХарин мүнөө б..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Daniel Vendrell, conegut com Vandrell, ha sigut un dels il•lustradors contemporanis més influents, representant a la nova onada..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano yo gano..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Шаьш анархисташ ду бохучу жигархойн дIахьедарехь дуьйцу, оьрсийн ницкъаллийн структурийн а, федералан каналан а Iалашонаш \\\"мар..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Si Isko walay pupamilok nga nagtan-aw sa unahan, natugaw. “Naunsa ka gud diha Isko nga layo man kaayo ang imong panan-aw?” ni I..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"رسی رۆژ - ساڵێک دوای بومەلەرزەی کرماشان میوانی بەرنامە : کاک سیاوەش حەیاتی چالاکی مەدەنی -قەسری شیرین\\nپارچە موزیک 30 / 10 / 20..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Akce anarchistů proti připravovanému novému služební řádu a nízkým mzdám 1903 – Historie českého anarchismu (1880 – 1939)\\nRost..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Шыранӑ чухне ӑнсӑртран латин кирилл саспаллисем вырӑнне латин саспаллисене ҫырсан, сайт эсир ҫырнине юсама тӑрӑшӗ.\\nКу сайтра ч..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Mae capeli Cymreig yr Andes ym Mhatagonia wedi cyhoeddi na fydd gwasanaethau yno weddill y mis, oherwydd yr eira trwm sydd wedi..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Den 2.-5. februar 2016 løb det tredje kursus i uddannelsen af 4kommunesamarbejdets Local Impact Coaches, af stablen i Gentofte ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Auf dieser Seite gibt es mind. ein YouTube Video. Cookies für diese Website wurden abgelehnt. Dadurch können keine YouTube Vide..."
}
An example of 'train' looks as follows.
{
"id": 0,
"text": "Zıwanê Slawki, zıwano merdumanê Slawano. Zıwanê Slawki yew lızgeyê Zıwananê Hind u Ewropao. Keyeyê Zıwananê Slawki beno hirê letey:"
}
An example of 'train' looks as follows.
{
"id": 1,
"text": "Pśiklaskaju južo pśed pśedstajenim... 1500 źiśi njamóžo wěcej docakaś, měsćańska hala w Chóśebuzu - wupśedana."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ބ. އަތޮޅުގައި ހުޅުވަން ތައްޔާރުވަމުން އަންނަ ވައްކަރު ރިސޯޓުގައި ވަޒީފާ އަދާކުރަން ޝައުގުވެރިވާ ފަރާތްތަކަށް ކުރިމަތިލުމުގެ ފުރ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Νεκρός εντοπίστηκε μέσα στο σπίτι του στην οδό Ηρώδου Αττικού στον αριθμό 7 ο επικεφαλής του προξενικού τμήματος της Ρωσικής πρ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"A séguit dal prucès ad rubutiśasiòṅ di abitànt dal pòpul ad Mikenes, Angoras 'l è finî dènt'r a 'n robot cun la tèsta dna rana ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, which he shared with John Blanchard during his first visi..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Ĉu ... preĝi | mediti | ricevi instigojn || kanti | muziki || informiĝi | legi | studi || prepari Diservon\\nTemas pri kolekto d..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Como se librará de la celulitis en el gimnasio La piel superflua en las manos después del adelgazamiento, Los bailes fáciles pa..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"MTÜ AB Video järgib oma tegevuses kodanikuühenduste eetilise tegevuse üldtunnustatud põhimõtteid, mis on lühidalt kokkuvõetud 7..."
}
An example of 'train' looks as follows.
{
"id": 0,
"text": "Gure jarduerek eraikuntzarekin, elkarbizitzarekin, hirigintzarekin eta ekologiarekin dute harremana, baita ideia eta konponbideak irudikatu eta garatzearekin ere, eraikuntza sektorea hobetuz, pertsonen erosotasuna eta bizi-kalitatea hobetzeko."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"قـــــــــــــــــرار بود با هم کنـــــــــــــار بیایم نه اینکه از کنــــــــــــار هم رد بشیم...!!!\\nاگر روزی دلت لبریز غم بو..."
}
An example of 'train' looks as follows.
{
"id": 1,
"text": "Kiitos Deelle kaikesta - 1,5 viikkoa kulunut, kun Dee ei ole enää ollut omani. Reilu viikko sitten sunnuntaina vein Deen uuteen kotiinsa. Itselläni on ollut niin ristiriitaiset t..."
}
An example of 'train' looks as follows.
{
"id": 0,
"text": "Média de débat d'idées, de culture et de littérature. Récits, décryptages, analyses, portraits et critiques autour de la vie des idées. Magazine engagé, ouvert aux autres et au monde.. Bring up to date in french"
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Hiragana’ Practice’Sheet’1’(A -O)’ ’ Name:’________ __________________________’Section:’_______________ _’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ..."
}
An example of 'train' looks as follows.
{
"id": 1,
"text": "Nim in sêfte ride op Holmsjön, yn ien fan 'e lytse marren yn de omkriten, of nim se op avontueren lykas nonresidential. lâns Indalsälven wetter. Holm Sportklubb hawwe kano 's te huur, yn gearwurking mei de Baltyske Power konferinsje."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Is fóram é seo chun plé a dhéanamh ar an leabhar atá roghnaithe do mhí na Samhna 2013 amháin. Ní féidir ach le baill chláraithe..."
}
An example of 'train' looks as follows.
{
"id": 0,
"text": "Zhou Yujun, a 'phàrtaidh Rùnaire Comataidh Sgìre Yanfeng ann Hengyang bhaile agus a Sgìre pàrtaidh agus an riaghaltas a' bhuidheann-riochdachaidh a 'tighinn a chèilidh air ar companaidh air Apr. 14, 2017."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"O persoal de Inditex da provincia de Pontevedra segue a reclamar iguais condicións laborais no conxunto do país - CIG: Confeder..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"º ÑÆÚÓ À Ã Ð É Æ ¾ ÄÂ Î À ¼ Æ É ÄÛ = Ü Ý\\\"Þ ßà á â ã ä å æçè ã é ê â å àë ì æê íî é á ë ï í çì àð í Ü à ñ ê é ò ä ì\"..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"दुष्ट शीळ हें कौरवांचें । रामें सविस्तर देखूनि साचें । बोलिले वचनें जें दुर्वाचे । करी तयांचें अनुस्मरण ॥२२०॥\"..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"અધિક માસ ચાલે છે. સમગ્ર ભારતમાં અને તેમાંય ખાસ કરીને પવિત્ર કે ધાર્મિક કહેવાય છે તેવા સ્થાનક પર કથાનો દોર ચાલે છે. ઉનાળાની કાળઝ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"זקוקים לרשתות נגד יתושים? מחפשים רשת מתאימה לחלון צר וקטן? רשתות נגד יתושים אקורדיון של חברת קליר-מש הן הפתרון.\\nרשתות לחלונות ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"'आइटम गर्ल' बनकर हिट हुई थीं राखी सावंत, आज करीना-कटरीना तक फॉलो कर रही हैं ट्रेंड नक्सलियों का दम निकालेगा बाइक ग्रेनेड लॉन्च..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"U raspravi je sudjelovao i HSS-ov saborski zastupnik rekavši kako poljoprivrednici ne osjete mjere o kojima ministar govori jer..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Budyšin (SN/BŠe). Elektronikarjo mějachu lětsa cyle hinaši zazběh do swojeho wukubłanja. Wokrjesne rjemjeslnistwo bě mjenujcy w..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"monster - Amatőr, házi szex videók és kezdő csjaok pornó filmjei. - Free amateur, home made sex videos and online porn movies. ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Արցախի Հանրապետության հռչակման 26-րդ տարեդարձի կապակցությամբ Շուշիի Արվեստի կենտրոնում կազմակերպվել է մոսկվաբնակ նկարիչներ՝ հայ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha h..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Perihal dari itu, kalau kunci hal yang demikian hilang, pemilik wajib melapor ke bengkel sah untuk dibuatkan kunci baru dengan ..."
}
An example of 'train' looks as follows.
{
"id": 0,
"text": "Plastic Yo Yo Metal Yo Yos Wooden Yo Yo Keychain Yo Yo Translucent Yo Yo Light Up Yo Yo Globe Yo Yo Stress Reliever Yo Yo Jellyfish Yo Yo Sports Ball Yo Yo Sound Yo Yo Miniature Yo Yo Promotional Yo Yo Novelty Yo Yo Video Game Yo Yo ECO Recycled Yo Yo"
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Segun ken ni Ping-ay, ti yellow corn ti maysa kadagiti nadakamat a liberalized agricultural commodity iti daytoy a free trade k..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Chekia esas parlamentala republiko. La chefo di stato esas la prezidanto. Til 2013 lu elektesis dal parlamento. Pos ta yaro, ol..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Eyjar.net - upplýsinga- og fréttamiðill um Vestmannaeyjar - Fréttir - Nái núverandi stefna stjórnvalda fram að ganga mun það va..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Jaundice - causes, treatment & pathology massaggio a osteochondrosis dellindizio di una controindicazione\\nTrattamento su un co..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"神社などへ一緒に同行して、様々な角度のショットで家族写真やお子様の写真を撮影致します!お好みに合わせて様々な写真を取ることができますので、その場でカメラマンへのリクエストも可能です!お子様の晴れ姿を、緊張していない自然な笑顔で残しませんか?\\n※七五三の..."
}
An example of 'train' looks as follows.
{
"id": 1,
"text": "ni'o 23 la cimast. cu 23moi djedi fi'o masti la cimast. noi ke'a cu cimoi masti .i 22 la cimast. cu purlamdei .ije 24 la cimast. cu bavlamdei"
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"José Mourinho (diwaca: [ʒuˈzɛ moˈɾiɲu]; lair ing Setubal, Portugal, 26 Januari 1963; umur 55 taun) iku salah siji pelatih bal k..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"წამიყვანე შენთან ერთად (ქართულად) / Возьми меня с собой (картулад) / (რუსული სერიალები ქართულად) (რუსების პორნო ონლაინში) (ruse..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Түлкібас ауданында «Латын негізді әліпби мен емле ережесі туралы насихат» жобасының тобы семинар өткізді\\nЕлорданың «Қазақстан»..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ខ្សឹបដាក់ត្រចៀក៖ លោក សួស សុផានិត នាយផ្នែករដ្ឋបាលព្រៃឈើ ស្រុកភ្នំក្រវាញ់ ដែលទើបឡើងកាន់តំណែងថ្មី បើកដៃឲ្យឈ្នួញ ប្រព្រឹត្តបទល្មើស ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ರಾಷ್ಟ್ರಪತಿ ಪ್ರಣಬ್ ಮುಖರ್ಜಿಯಿಂದ ಪದ್ಮ ಪ್ರಶಸ್ತಿ ಪ್ರದಾನ | President Pranab Mukherjee Confers Padma Awards | Photo Gallery on Kannada..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"CIA 프로젝트에서는 데이터베이스로 들어오는 요청을 중간에 수집(Sniffing)하고 수집한 데이터를 분석(Parsing)하여 그로 인한 결과를 판단하여 알릴 수 있는 시스템(Push Service)이 필요하다. 그리고 연구를 ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Шамханланы, Бийлени къаршысына ябушуп, Батыр уланларыбызны къоллары булан «ортакъ ожакъ» къургъанбыз. Шо иш уллу зараллы иш бол..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Me di 114 bernameyên xwe yên berê da perçeyên ji berhemên zanyarî yên kurdzanên mezin bi wergera kurdî da ...\\nMe di 114 bernam..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Коми кытшыслӧн ыджытжык тор вӧр увтын куйлӧ, сійӧн и фаунасӧ татӧн аркмӧтӧны вӧрын олісь подаэз. Ассямаӧн лоӧ сія, мый кытшас с..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"????????????????????????????????????????????????????????????????????Pray without ceasing???????????????????????????????????????..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Turmush: Бишкек шаардык кеңешинин кезексиз отурумунда мэрге ишенбөөчүлүк көрсөтүү маселеси каралат, - депутат Т.Сагынов\\nБишкек..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Hæ sunt generationes Noë: Noë vir justus atque perfectus fuit in generationibus suis; cum Deo ambulavit.\\nEcce ego adducam aqua..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Während dem Gaardefestival \\\"Ambiance Jardins\\\" vum 15. bis de 17. Mee huet den SNJ nees zesumme mam Groupe Animateur en Inform..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Ахцегь хуьр, виридалай ч1ехи лезги хуьрерикая я. Ам Урусатдин виридалай къиблепатавай хуьрерикай я. Ин хуьр...\"..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"'t Good Goedenraad aan de Ezerbaek besjteit oet 'n kesjtièl mèt gesjlote haof en 'n park van 26 hectare. Hie in sjtoon väól beu..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Serét (en tortonés: Sregh; en piemontés: Srèj) l'è 'n cümü italià, de la regiù del Piemónt, en Pruvìncia de Alessandria. El g'h..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"ຜູ້ພິພາກສາ ປະຈຳເຂດ ສຫລ ທ່ານນຶ່ງ ຕັດສິນວ່າ ໂຄງການເກັບກຳຂໍ້ມູນ ທາງໂທລະສັບ ຂອງອົງການ ຄວາມໝັ້ນຄົງແຫ່ງຊາດ ແມ່ນຖືກຕ້ອງ ຕາມກົດໝາຍ.\\nກະ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"آرلینگتون یئ گئل د شأریا ڤولاتچە ڤیرجینیا و یئ گئل د شأریا ڤولات ڤولاتچە یا یأکاگئرئتە ئمریکاە. ئی شأر دویومی کألوٙن شأر د راسا..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Čir vir vir pavasaris! Čia čia čia… dalinamės labai simpatiška video pamokėle, kurią pristato ab888art galerija.\\nBe galo papra..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Dekoratīvi sliekšņi MITSUBISHI OUTLANDER 2007, izgatavoti no ovālas formas, pulētas nerūsējošā tērauda caurules...\\ndažādas tūn..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"१ · २ · ३ · ४ · ५ · ६ · ७ · ८ · ९ · १० · ११ · १२ · १३ · १४ · १५ · १६ · १७ · १८ · १९ · २० · २१ · २२ · २३ · २४ · २५ · २६ · २७ · २..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Nanamboatra taratasy apetaka sy soso-kevitra ho an'ny olona te-hanatevin-daharana ity fihetsiketsehana ity i Anocrena.\\nNosorat..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Акрет жап годым Уганда кундемым Пигмей племена- влак айлен шогеныт. мемнан эран 1 курым гыч Банту племена влакат тиде кундемышк..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\" ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"„Филм плус“ е насловен првиот филмски месечник во Македонија, чиј прв број ќе биде промовиран вечер во „Менада“. Новото македон..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"സ്ത്രീ പ്രവേശനം സര്ക്കാര് പൂര്ണമായും അംഗീകരിക്കുന്നുവെന്നും ശബരിമലയുടെ സുരക്ഷയില് ഇടപെടുമെന്നും സര്ക്കാര് ഹൈക്കോടതിയില്\\..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Монгол улс, Улаанбаатар хот - 14191 Энхтайваны өргөн чөлөө - 10, Багш хөгжлийн ордон, Багшийн мэргэжил дээшлүүлэх институт\\nБаг..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Home / motivational marathi story / उद्योजकता (Entrepreneurship) / यांना हे जमलय, तर आपल्याला का नाही जमणार ?\\nयापैकी कोणाचीही ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Лӹпӹвлӓ (латинлӓ Lepidoptera ; алыкмарла лыве-влак) — капшангывлӓ йыхыш пырышы сӱмӓн нӹл шылдыран капшангывлӓ. Цилӓжӹ 180000 тӹ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Sanad pertama daripada Zuhair bin Harb daripada ‘Affan daripada Hammad daripada Thabit daripada Anas.\\nSanad kedua daripada ‘Ab..."
}
An example of 'train' looks as follows.
{
"id": 0,
"text": "tibgħat il-kawża lura lill-Qorti Ġenerali għall-annullament jew għat-tnaqqis tal-penalità imposta mill-Kummissjoni bid-deċiżjoni inizjali kif emendata bid-deċiżjoni ta’ rettifika;"
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Deciplina social i outónoma que angloba atebidades de ouserbaçon, de análeze, de çcriçon, cumparaçon, de sistematizaçon i de sp..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ျမ၀တီ - ရန္ကုန္တိုင္းေဒသႀကီး ေျမာက္ဥကၠလာပႏွင္႕ ဗဟန္းၿမိဳ႔နယ္ မေကြးတိုင္း ေဒသႀကီး ပခုကၠဴၿမိဳ႔နယ္တို႔၌ ျမန္မာ႕တပ္မေတာ္အား ေထာက္ခံ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"2018 иень умарьковонь 6-це чистэ сась паро куля! Россиянь культурань Министерствась макссь невтемань конёв (прокатной удостовер..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"قرآن یا قوران اسلام ِآسمونی کتاب هسته. مسلمونون گانّّه قرآن ره خدا، وحی جه برسنییه، «محمد معجزه» هسته و ثقلین حدیث دله ونه خَو..."
}
An example of 'train' looks as follows.
{
"id": 0,
"text": "In mācuīlpōhualxihuitl VI (inic chicuacē) in mācuīlpōhualli xiuhitl cāhuitl īhuīcpa 501 xihuitl oc 600 xihuitl."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ò AUDIT í Ç è î ÿ å å 30 ò ÿ ÿ é, õ ñ ì ÿ, ê ã- ò à ì. å â å í ç â à à é ñ è å é ó ó ë. å å å û è å î é è à. à è à AUDIT 1-7 â ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Dor kann sik vun nu af an de hele plattdüütsche Welt – vun Niebüll bit New York, vun Helgoland bit Honolulu – drapen. Allens, w..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"बर्दिबास नगरपालिकाको तेस्रो नगर परिषदबाट पारित आ.व.२०७३।७४ को संशोधित र २०७४।७५ को प्रस्तावित नीति, कार्यक्रम तथा बजेट\\nअार्थिक..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"थ्व शहरयागु अक्षांश ३४.७००१६४ उत्तर व देशान्तर ८६.३७६४६९ पश्चिम खः (34.700164° N 86.376469° W)। थ्व थासे ७२२६७३२ वर्ग मिटर (२.७..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Op vrijdag 31 augustus wordt het nieuwe studiejaar van de masteropleiding architectuur geopend met een dagexcursie naar Venlo.\\..."
}
An example of 'train' looks as follows.
{
"id": 0,
"text": "Planomtale krav til innhald Bakgrunn: Spørsmål frå fleire kommunar om kva ein planomtale/planbeskrivelse bør innehalde Fylkeskommunen og fylkesmannen har i ein del saker reist motsegn på formelt grunnlag"
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Ytterligere aktører i primærhelsetjenesten og andre NHS-virksomheter ble infisert, inkludert legekontor.Læreren vår er så attra..."
}
An example of 'train' looks as follows.
{
"id": 1,
"text": ".рф (rf, còdi punycode: .xn--p1ai)[1] es lo nom de domeni en rus per Russia. Foguèt activat lo 12 de mai de 2010. Lo còdi latin es .ru."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ଭୁବନେଶ୍ୱର, ୨୭/୧– (ଓଡ଼ିଆ ପୁଅ) ସିପିଆଇ ଜାତୀୟ ପରିଷଦର ଆହ୍ୱାନକ୍ରମେ ଗତକାଲି ଜାନୁୟାରୀ ୨୬ ସାଧାରଣତନ୍ତ୍ର ଦିବସକୁ ଦେଶ ବ୍ୟାପୀ ସମ୍ବିଧାନ ସୁରକ୍ଷା ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"1. Лæппу æмæ чызг казрæдзийы зæрдæмæ куы фæцæуынц æмæ, куы сфæнд кæнынц сæ цард баиу кæнын, уæд лæппу бар ракуры чызгæй, цæмæй ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ਰਜਿ: ਨੰ: PB/JL-138/2018-20 ਜਿਲਦ 63, ਬਾਨੀ ਸੰਪਾਦਕ (ਸਵ:) ਡਾ: ਸਾਧੂ ਸਿੰਘ ਹਮਦਰਦ ਫ਼ੋਨ : 0181-2455961-62-63, 5032400, ਫੈਕਸ : 2455960, 2..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Áku pu i Anak ning Aláya at ngeni ipákit kó kékayu ngan nûng makanánu lang susúlat détinang kulit a mágkas. Lauan ya ing tarátu..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"System informatyczny - Załącznik nr 1 do zarządzenia Wójta Gminy Podegrodzie Nr 530/2013 z dnia 27 maja 2013 r\\nSystem informat..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Louvigné-du-Désert a l'é na comun-a fransèisa ant la region aministrativa dla Brëtagna, ant ël dipartiment d'Ille-et-Vilaine. A..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"ایہ فائل Wikimedia Commons توں اے تے دوجیاں ویونتاں تے وی ورتی جاےکدی اے۔ گل بات اس دے فائل گل بات صفہ تے تھلے دتی گئی۔\"..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Many people usually use the time period ‘business to business (B2B) advertising,’ however most of them do not know precisely wh..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Você pode estar lendo este texto no sofá, levantar pra pegar uma breja na geladeira, dar uma cagada e sentar novamente, sem int..."
}
An example of 'train' looks as follows.
{
"id": 1,
"text": "Warayu wichay (kastilla simipi: Ascensión de Guarayos) nisqaqa Buliwya mama llaqtapi, Santa Krus suyupi, huk llaqtam, Warayu pruwinsyap uma llaqtanmi."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"practicists agrars / practicistas agraras AFP pon far ina furmaziun da basa scursanida per cuntanscher in attestat federal da q..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"“În viață, oportunitatea nu este totul. Cine atrage Lumina, cineva bun în umbră. Timpul ne creează.” maestru\\nLyn.Evans: Ce mar..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Доступ к данному профилю для публичного просмотра закрыт администрацией сайта - профиль находится на модерации.\\nРазработчикам ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"अनिरुद्धनगरे क्रीडिता रामलीला सम्प्रति समाप्ता अस्ति । तस्य कानिचन् चित्राणि पूर्वमेव प्रकाशितानि सन्ति । द्वौ चलचित्रौ अपि ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████..."
}
An example of 'train' looks as follows.
{
"id": 0,
"text": "La gilusìa è nu sintimentu dulurusu ca nasci d'un disideriu di pussessu sclusivu ntê cunfrunti dâ pirsuna amata e dû timuri, dû suspettu o dâ cirtizza dâ sò nfidiltati."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"هر ڪو ڄاڻي ٿو ته جڏهن توهان هڪ وڏي خريد ڪرڻ چاهيون ٿا, توهان پڄي ضروري حڪم ۾ ان جي ڪم ڪرڻ جي هٿ ۾ لاڳاپو ڪيو آهي. جي شيء آهي ته..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Opština Gornja Radgona se nalazi u sjeveroistočnoj Sloveniji i graniči s susjednom Austriji duž rijeke Mure. Sa tridesetim nase..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"ලාංකීය සිතිවිලි සිංහල බ්ලොග් කියවනය කොත්තු සින්ඩිය ලංකා Blogger හත්මාළුව ලංකා බ්ලොග් කියවනය මාතලන්ගේ සින්ඩිය මොබයිල්lk\\nඅවකාශය ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Aktivity | Agentúra podporovaného zamestnávania | vzdelávanie pre klientov, vzdelávanie pre odborníkov, kurzy\\nŠpecializované k..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Če Creatures, ki je želel, da pridejo na čas, predvsem je povedlo – razlikuje od ljubosumja začel grizenja kolen (ali zadnjica)..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"тттттттттттттттттттттттттттттттт тттттттттттттттттттттттттттттттт тттттттттттттттттттттттттттттттт ттттттттттттттттуууууууууууу..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Çfarë do të më pëlqente tek një femër ose çfarë do të më shndërronte në një shpërthim drite? – Albert Vataj\\nTë gjithëve një zo..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Корисни савети за сваки дан. На сајту су разне категорије, као што су љепота, мода, кување и поправка властитим рукама.\\nШколск..."
}
An example of 'train' looks as follows.
{
"id": 1,
"text": "Kartu krédit nyaéta \"duit plastik\" anu dikaluarkeun ku bank pikeun alat pambayaran di tempat-tempat nu tangtu samisal jiga di hotél, réstoran, tempat rékréasi jeung sajabana.[1]"
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"1783 är ett viktigt årtal i den nya tidens historia. Det året slöts en fred i Paris och därmed blev de 13 brittiska kolonierna ..."
}
An example of 'train' looks as follows.
{
"id": 1,
"text": "Miripuko hiyo inakuja mwanzoni mwa Wiki Takatifu kuelekea Pasaka na ikiwa ni wiki chache tu kabla ya Papa Francis kuanza ziara yake katika nchi hiyo yenye idadi kubwa kabisa ya watu katika ulimwengu wa nchi za Kiarabu."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"பொழுது சாய்ந்து வெகு நேரமாகிவிட்டது. கூலி வேலைக்குப் போயிருந்த 'சித்தாள் ' பெண்கள் எல்லோரும் வீடு திரும்பி விட்டார்கள். இன்னும்..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"హర్యానాలో టోల్ దగ్గర సిబ్బంది.. స్థానిక ప్రజలు కొట్టుకున్నారు. కర్నాల్ అనే గ్రామానికి సమీపంలో టోల్ గేట్ ఉంది. అయితే సాధారణంగా స..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Ҳумайро гуфтааст, мухолифи низом аст, низоме, ки дар Тоҷикистон вуҷуд дорад. Ба ин маънӣ, худро мухолифи давлату ҳукумати Тоҷик..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ฟันที่แลดูขาวสะอาดไม่มีเศษอาหารติดอยู่ เหงือกสีชมพู ไม่เจ็บ หรือมีเลือดออกเวลาแปรงฟันหรือขัดฟัน ไม่มีปัญหาเรื่องกลิ่นปาก ทำให้ก..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"Türkmenistanyň Prezidenti agyr atletika boýunça dünýä çempionatyna taýýarlyk işleriniň barşy bilen tanyşdy\\nHalallykdan kemal t..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"“Gusto ko manawagan sa mga Unit Head ng Chanel 2 Salve. Kasi napapansin ko iyon mga alaga ko ang taping halos once a week lang,..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Son yıllarda görülen ay tutulmalarına göre daha etkili olacağı söylenen Kanlı veya Kırmızı Ay Tutulmasına saatler kaldı. Bu akş..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"\\\"Иремнең вафатына 40 көн узгач, Алмаз да безнең өйгә кереп үлде\\\". Арчада 35 яшьлек ир өстенә кондызлар ега башлаган агач төшк..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Экии, хүндүлуг аалчылар болгаш тыва дылдың деткикчилери! Тыва дылдың болгаш чогаалдың ховар бир башкызынга, Менги Ооржакка, ажы..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"زاڭ-ءتۇزىم | عىلىم-تەحنيكا | ءتىل-ادەبيەت | تۇرمىس | دەنە تاربيە | ساياحات-ورتا | سۋرەتتى حابار | سىر سۇحبات | ارناۋلى تاقىرىپ ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Про надання роз'яснення (щодо форми письмового зобов'язання громадян про зворотне ввезення/вивезення товарів), Державна митна с..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"آئیے اہم اسلامی کتب کو یونیکوڈ میں انٹرنیٹ پر پیش کرنے کے لئے مل جل کر آن لائن ٹائپنگ کریں۔ محدث ٹائپنگ پراجیکٹ کے ذریعے آپ روز..."
}
An example of 'train' looks as follows.
{
"id": 1,
"text": "Qurama tog'lari tizmasining Toshkentdan 154 km uzoqlikdagi Toshkent-Ush yo'li yeqasidaxushmanzara tabiat qo'ynida joylashgan maydoni 30 ga.\nBolalarni sog'lomlashtirish oromgohi Bo'stonliq tumani Oqtosh muntaqasining soy-salqin gushasida joylashgan."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Par ogni pónto, ła derivada ła xe ła pendensa de ła reta tangente a ła curva de ła funsion f. Ła reta de cołor róso l'è senpre ..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Canh chua cá bông lau không chỉ là món ăn giải nhiệt, thanh mát ngày hè mà còn là món siêu bổ dưỡng, rất tốt cho người gầy ốm. ..."
}
An example of 'train' looks as follows.
{
"id": 1,
"text": "Sarniguet binon zif in ziläk: Hautes-Pyrénées, in topäd: Midi-Pyrénées, in Fransän. Sarniguet topon videtü 43°19’ 7’’ N e lunetü 0°5’ 19’’ L."
}
An example of 'train' looks as follows.
{
"id": 1,
"text": "Cisse pådje ci n' est co k' on djermon, dj' ô bén k' el pådje est djusse sibåtcheye, eyet co trop tene; et s' divreut ele ecråxhî ene miete."
}
An example of 'train' looks as follows.
{
"id": 1,
"text": "An Honce amo in usa ka baryo ngan munisipalidad ha distrito han Rožňava ha rehiyon han Košice ha nasod han Slovakia.\nAn Rumegies amo in usa ka komyun ha departamento han Nord ngan ha rehiyon han Nord-Pas-de-Calais ha nasod han Fransya."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"伊春元旦天气 伊春腊八天气 伊春春节天气 伊春情人节天气 伊春元宵节天气 伊春愚人节天气 伊春清明节天气 伊春劳动节天气 伊春母亲节天气 伊春端午节天气 伊春七夕节天气 伊春教师节天气 伊春中秋节天气 伊春国庆节天气 伊春重阳节天气 伊春万圣节天气 伊春..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Арнгудин Орн гисн Европд бәәдг һазр. 2007 җилин тooһaр эн орн нутгт 3,600,523 әмтн бәәдг билә. Арнгудин Орнин хотл балһсна нерн..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"მოჩამილი ტექსტი წჷმორინელი რე Creative Commons Attribution-ShareAlike ლიცენზიათ; შილებე გეძინელი პირობეფიშ არსებუა. კილიშკილიშა..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"ממשותדיק - חבֿרה, איך אַרבעט איצט אױף אַ זשורנאַל. טאָמער איר האָט עפּעס צוצוגעבן זאָלט איר שיקן מיר אַן אָנזאָג. ס'װעט הײסן \\\"..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 0,
"text": "\"Copyright © 2018 BBC. BBC kò mọ̀ nípa àwọn ohun tí ó wà ní àwọn ojú òpó tí ó wà ní ìta. Ọwọ́ tí a fi mú ìbáṣepọ̀ ti ìta.\"..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 我 灌 我 灌 我 灌 灌 灌 你還不爆 我累了 投降輸一半可以嗎\"..."
}
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"id": 1,
"text": "\"中国铝灰网 中国有色金属矿产网 中国黄莲网 中国水轮发电机网 中国抽油泵网 中国数控雕刻机网 中国不锈钢抛光网 中国磨具加工网 中国压铸铝网 中国耐水腻子网 中国手机摄像头网 中国粗粮网 中国车门锁网 中国钛粉网 中国轮圈网\\n天天中奖彩票图 天天中彩票..."
}
The data fields are the same among all configs.
Language | Language code | Name original | Train original | Words original | Size original | Name deduplicated | Train deduplicated | Words deduplicated | Size deduplicated |
---|---|---|---|---|---|---|---|---|---|
Afrikaans | af | unshuffled_original_af | 201117 | 43,482,801 | 241M | unshuffled_deduplicated_af | 130640 | 29,533,437 | 163M |
Albanian | sq | unshuffled_original_sq | 672077 | 374,196,110 | 2.3G | unshuffled_deduplicated_sq | 461598 | 186,856,699 | 1.2G |
Alemannic | als | unshuffled_original_als | 7324 | 841,750 | 5.0M | unshuffled_deduplicated_als | 4518 | 459,001 | 2.8M |
Amharic | am | unshuffled_original_am | 83663 | 28,301,601 | 360M | unshuffled_deduplicated_am | 43102 | 16,086,628 | 206M |
Arabic | ar | unshuffled_original_ar | 16365602 | 8,117,162,828 | 82G | unshuffled_deduplicated_ar | 9006977 | 3,171,221,354 | 32G |
Aragonese | an | unshuffled_original_an | 2449 | 52,896 | 1.3M | unshuffled_deduplicated_an | 2025 | 45,669 | 801K |
Armenian | hy | unshuffled_original_hy | 659430 | 273,919,388 | 3.7G | unshuffled_deduplicated_hy | 396093 | 110,196,043 | 1.5G |
Assamese | as | unshuffled_original_as | 14985 | 6,956,663 | 113M | unshuffled_deduplicated_as | 9212 | 4,366,570 | 71M |
Asturian | ast | unshuffled_original_ast | 6999 | 381,005 | 2.4M | unshuffled_deduplicated_ast | 5343 | 325,237 | 2.0M |
Avaric | av | unshuffled_original_av | 456 | 24,720 | 409K | unshuffled_deduplicated_av | 360 | 19,478 | 324K |
Azerbaijani | az | unshuffled_original_az | 912330 | 322,641,710 | 2.8G | unshuffled_deduplicated_az | 626796 | 167,742,296 | 1.5G |
Bashkir | ba | unshuffled_original_ba | 42551 | 9,796,764 | 128M | unshuffled_deduplicated_ba | 27050 | 6,922,589 | 90M |
Basque | eu | unshuffled_original_eu | 506883 | 120,456,652 | 848M | unshuffled_deduplicated_eu | 256513 | 45,359,710 | 342M |
Bavarian | bar | unshuffled_original_bar | 4 | 399 | 503 | unshuffled_deduplicated_bar | 4 | 399 | 503 |
Belarusian | be | unshuffled_original_be | 586031 | 144,579,630 | 1.8G | unshuffled_deduplicated_be | 307405 | 83,499,037 | 1.1G |
Bengali | bn | unshuffled_original_bn | 1675515 | 623,575,733 | 11G | unshuffled_deduplicated_bn | 1114481 | 363,766,143 | 5.8G |
Bihari | bh | unshuffled_original_bh | 336 | 8,848 | 110K | unshuffled_deduplicated_bh | 82 | 2,875 | 34K |
Bishnupriya | bpy | unshuffled_original_bpy | 6046 | 198,286 | 4.1M | unshuffled_deduplicated_bpy | 1770 | 96,940 | 1.7M |
Bosnian | bs | unshuffled_original_bs | 2143 | 106,448 | 447K | unshuffled_deduplicated_bs | 702 | 20,485 | 116K |
Breton | br | unshuffled_original_br | 37085 | 5,013,241 | 29M | unshuffled_deduplicated_br | 14724 | 2,890,384 | 16M |
Bulgarian | bg | unshuffled_original_bg | 5869686 | 2,947,648,106 | 32G | unshuffled_deduplicated_bg | 3398679 | 1,268,114,977 | 14G |
Burmese | my | unshuffled_original_my | 232329 | 56,111,184 | 1.9G | unshuffled_deduplicated_my | 136639 | 30,102,173 | 1.1G |
Catalan | ca | unshuffled_original_ca | 4390754 | 1,360,212,450 | 8.0G | unshuffled_deduplicated_ca | 2458067 | 729,333,440 | 4.3G |
Cebuano | ceb | unshuffled_original_ceb | 56248 | 6,603,567 | 39M | unshuffled_deduplicated_ceb | 26145 | 3,675,024 | 24M |
Central Bikol | bcl | unshuffled_original_bcl | 1 | 312 | 885 | unshuffled_deduplicated_bcl | 1 | 312 | 885 |
Central Khmer | km | unshuffled_original_km | 159363 | 20,690,610 | 1.1G | unshuffled_deduplicated_km | 108346 | 10,082,245 | 581M |
Central Kurdish | ckb | unshuffled_original_ckb | 103639 | 48,478,334 | 487M | unshuffled_deduplicated_ckb | 68210 | 18,726,721 | 226M |
Chavacano | cbk | unshuffled_original_cbk | 1 | 130 | 520 | unshuffled_deduplicated_cbk | 1 | 130 | 520 |
Chechen | ce | unshuffled_original_ce | 4042 | 711,051 | 8.3M | unshuffled_deduplicated_ce | 2984 | 568,146 | 6.7M |
Chinese | zh | unshuffled_original_zh | 60137667 | 14,986,424,850 | 508G | unshuffled_deduplicated_zh | 41708901 | 6,350,215,113 | 249G |
Chuvash | cv | unshuffled_original_cv | 20281 | 3,041,614 | 39M | unshuffled_deduplicated_cv | 10130 | 2,054,810 | 26M |
Cornish | kw | unshuffled_original_kw | 203 | 8,329 | 44K | unshuffled_deduplicated_kw | 68 | 2,704 | 14K |
Croatian | hr | unshuffled_original_hr | 582219 | 34,232,765 | 226M | unshuffled_deduplicated_hr | 321484 | 16,727,640 | 110M |
Czech | cs | unshuffled_original_cs | 21001388 | 7,715,977,441 | 53G | unshuffled_deduplicated_cs | 12308039 | 3,540,997,509 | 24G |
Danish | da | unshuffled_original_da | 7664010 | 2,637,463,889 | 16G | unshuffled_deduplicated_da | 4771098 | 1,620,091,317 | 9.5G |
Dhivehi | dv | unshuffled_original_dv | 21018 | 7,559,472 | 126M | unshuffled_deduplicated_dv | 17024 | 4,726,660 | 79M |
Dimli | diq | unshuffled_original_diq | 1 | 19 | 146 | unshuffled_deduplicated_diq | 1 | 19 | 146 |
Dutch | nl | unshuffled_original_nl | 34682142 | 13,020,136,373 | 78G | unshuffled_deduplicated_nl | 20812149 | 6,598,786,137 | 39G |
Eastern Mari | mhr | unshuffled_original_mhr | 3212 | 565,992 | 7.2M | unshuffled_deduplicated_mhr | 2515 | 469,297 | 6.0M |
Egyptian Arabic | arz | unshuffled_original_arz | 158113 | 7,305,151 | 66M | unshuffled_deduplicated_arz | 79928 | 3,659,419 | 33M |
Emilian-Romagnol | eml | unshuffled_original_eml | 84 | 6,376 | 25K | unshuffled_deduplicated_eml | 80 | 6,121 | 24K |
English | en | unshuffled_original_en | 455994980 | 418,187,793,408 | 2.3T | unshuffled_deduplicated_en | 304230423 | 215,841,256,971 | 1.2T |
Erzya | myv | unshuffled_original_myv | 6 | 90 | 1.4K | unshuffled_deduplicated_myv | 5 | 78 | 1.2K |
Esperanto | eo | unshuffled_original_eo | 121171 | 48,486,161 | 299M | unshuffled_deduplicated_eo | 84752 | 37,324,446 | 228M |
Estonian | et | unshuffled_original_et | 2093621 | 643,163,730 | 4.8G | unshuffled_deduplicated_et | 1172041 | 309,931,463 | 2.3G |
Finnish | fi | unshuffled_original_fi | 8557453 | 3,196,666,419 | 27G | unshuffled_deduplicated_fi | 5326443 | 1,597,855,468 | 13G |
French | fr | unshuffled_original_fr | 96742378 | 46,896,036,417 | 282G | unshuffled_deduplicated_fr | 59448891 | 23,206,776,649 | 138G |
Galician | gl | unshuffled_original_gl | 544388 | 102,011,291 | 620M | unshuffled_deduplicated_gl | 284320 | 63,600,602 | 384M |
Georgian | ka | unshuffled_original_ka | 563916 | 171,950,621 | 3.6G | unshuffled_deduplicated_ka | 372158 | 91,569,739 | 1.9G |
German | de | unshuffled_original_de | 104913504 | 44,878,908,446 | 308G | unshuffled_deduplicated_de | 62398034 | 21,529,164,172 | 145G |
Goan Konkani | gom | unshuffled_original_gom | 640 | 124,277 | 2.2M | unshuffled_deduplicated_gom | 484 | 102,306 | 1.8M |
Guarani | gn | unshuffled_original_gn | 106 | 7,382 | 36K | unshuffled_deduplicated_gn | 68 | 4,680 | 24K |
Gujarati | gu | unshuffled_original_gu | 240691 | 72,045,701 | 1.1G | unshuffled_deduplicated_gu | 169834 | 50,023,432 | 722M |
Haitian | ht | unshuffled_original_ht | 13 | 1,014 | 3.9K | unshuffled_deduplicated_ht | 9 | 832 | 3.3K |
Hebrew | he | unshuffled_original_he | 3808397 | 2,067,753,528 | 20G | unshuffled_deduplicated_he | 2375030 | 1,032,018,056 | 9.8G |
Hindi | hi | unshuffled_original_hi | 3264660 | 1,372,234,782 | 17G | unshuffled_deduplicated_hi | 1909387 | 745,774,934 | 8.9G |
Hungarian | hu | unshuffled_original_hu | 11197780 | 5,163,936,345 | 40G | unshuffled_deduplicated_hu | 6582908 | 2,339,127,555 | 18G |
Icelandic | is | unshuffled_original_is | 625673 | 219,900,094 | 1.5G | unshuffled_deduplicated_is | 389515 | 129,818,331 | 846M |
Ido | io | unshuffled_original_io | 694 | 25,702 | 147K | unshuffled_deduplicated_io | 617 | 22,773 | 130K |
Iloko | ilo | unshuffled_original_ilo | 2638 | 142,942 | 874K | unshuffled_deduplicated_ilo | 1578 | 105,564 | 636K |
Indonesian | id | unshuffled_original_id | 16236463 | 4,574,692,265 | 30G | unshuffled_deduplicated_id | 9948521 | 2,394,957,629 | 16G |
Interlingua | ia | unshuffled_original_ia | 1040 | 180,231 | 662K | unshuffled_deduplicated_ia | 529 | 100,019 | 360K |
Interlingue | ie | unshuffled_original_ie | 101 | 5,352 | 24K | unshuffled_deduplicated_ie | 11 | 602 | 1.6K |
Irish | ga | unshuffled_original_ga | 83223 | 14,483,593 | 88M | unshuffled_deduplicated_ga | 46493 | 10,017,303 | 60M |
Italian | it | unshuffled_original_it | 46981781 | 22,248,707,341 | 137G | unshuffled_deduplicated_it | 28522082 | 11,250,012,896 | 69G |
Japanese | ja | unshuffled_original_ja | 62721527 | 4,962,979,182 | 216G | unshuffled_deduplicated_ja | 39496439 | 1,123,067,063 | 106G |
Javanese | jv | unshuffled_original_jv | 1445 | 104,896 | 659K | unshuffled_deduplicated_jv | 1163 | 86,654 | 583K |
Kalmyk | xal | unshuffled_original_xal | 39 | 10,277 | 113K | unshuffled_deduplicated_xal | 36 | 10,155 | 112K |
Kannada | kn | unshuffled_original_kn | 350363 | 81,186,863 | 1.7G | unshuffled_deduplicated_kn | 251064 | 49,343,462 | 1.1G |
Karachay-Balkar | krc | unshuffled_original_krc | 1581 | 185,436 | 2.6M | unshuffled_deduplicated_krc | 1377 | 166,496 | 2.3M |
Kazakh | kk | unshuffled_original_kk | 524591 | 191,126,469 | 2.7G | unshuffled_deduplicated_kk | 338073 | 108,388,743 | 1.5G |
Kirghiz | ky | unshuffled_original_ky | 146993 | 44,194,823 | 600M | unshuffled_deduplicated_ky | 86561 | 28,982,620 | 388M |
Komi | kv | unshuffled_original_kv | 1549 | 201,404 | 2.3M | unshuffled_deduplicated_kv | 924 | 95,243 | 1.2M |
Korean | ko | unshuffled_original_ko | 7345075 | 2,368,765,142 | 24G | unshuffled_deduplicated_ko | 3675420 | 1,120,375,149 | 12G |
Kurdish | ku | unshuffled_original_ku | 46535 | 15,561,003 | 94M | unshuffled_deduplicated_ku | 29054 | 9,946,440 | 60M |
Lao | lo | unshuffled_original_lo | 52910 | 4,133,311 | 174M | unshuffled_deduplicated_lo | 32652 | 2,583,342 | 114M |
Latin | la | unshuffled_original_la | 94588 | 4,122,201 | 26M | unshuffled_deduplicated_la | 18808 | 1,328,038 | 8.3M |
Latvian | lv | unshuffled_original_lv | 1593820 | 520,761,977 | 4.0G | unshuffled_deduplicated_lv | 843195 | 236,428,905 | 1.8G |
Lezghian | lez | unshuffled_original_lez | 1485 | 247,646 | 3.3M | unshuffled_deduplicated_lez | 1381 | 224,871 | 3.0M |
Limburgan | li | unshuffled_original_li | 137 | 4,730 | 29K | unshuffled_deduplicated_li | 118 | 4,283 | 27K |
Lithuanian | lt | unshuffled_original_lt | 2977757 | 1,159,661,742 | 8.8G | unshuffled_deduplicated_lt | 1737411 | 516,183,525 | 3.9G |
Lojban | jbo | unshuffled_original_jbo | 832 | 154,330 | 736K | unshuffled_deduplicated_jbo | 617 | 141,973 | 678K |
Lombard | lmo | unshuffled_original_lmo | 1401 | 75,229 | 443K | unshuffled_deduplicated_lmo | 1374 | 73,665 | 433K |
Low German | nds | unshuffled_original_nds | 18174 | 2,906,347 | 18M | unshuffled_deduplicated_nds | 8714 | 2,146,417 | 13M |
Lower Sorbian | dsb | unshuffled_original_dsb | 65 | 1,787 | 13K | unshuffled_deduplicated_dsb | 37 | 966 | 7.1K |
Luxembourgish | lb | unshuffled_original_lb | 34807 | 4,403,577 | 29M | unshuffled_deduplicated_lb | 21735 | 3,087,650 | 21M |
Macedonian | mk | unshuffled_original_mk | 437871 | 189,289,873 | 2.1G | unshuffled_deduplicated_mk | 299457 | 102,849,595 | 1.2G |
Maithili | mai | unshuffled_original_mai | 123 | 69,161 | 317K | unshuffled_deduplicated_mai | 25 | 874 | 11K |
Malagasy | mg | unshuffled_original_mg | 17957 | 3,068,360 | 21M | unshuffled_deduplicated_mg | 13343 | 1,872,044 | 13M |
Malay | ms | unshuffled_original_ms | 534016 | 16,696,882 | 111M | unshuffled_deduplicated_ms | 183443 | 6,045,753 | 42M |
Malayalam | ml | unshuffled_original_ml | 603937 | 189,534,472 | 4.9G | unshuffled_deduplicated_ml | 453904 | 95,892,551 | 2.5G |
Maltese | mt | unshuffled_original_mt | 26598 | 2,995,654 | 24M | unshuffled_deduplicated_mt | 16383 | 2,163,358 | 17M |
Marathi | mr | unshuffled_original_mr | 326804 | 162,609,404 | 2.7G | unshuffled_deduplicated_mr | 212556 | 82,130,803 | 1.4G |
Mazanderani | mzn | unshuffled_original_mzn | 1055 | 73,870 | 691K | unshuffled_deduplicated_mzn | 917 | 64,481 | 602K |
Minangkabau | min | unshuffled_original_min | 220 | 5,682 | 608K | unshuffled_deduplicated_min | 166 | 4,825 | 310K |
Mingrelian | xmf | unshuffled_original_xmf | 3783 | 299,098 | 5.8M | unshuffled_deduplicated_xmf | 2418 | 228,629 | 4.4M |
Mirandese | mwl | unshuffled_original_mwl | 8 | 171 | 1.2K | unshuffled_deduplicated_mwl | 7 | 152 | 1.1K |
Modern Greek | el | unshuffled_original_el | 10425596 | 5,479,180,137 | 62G | unshuffled_deduplicated_el | 6521169 | 2,412,419,435 | 27G |
Mongolian | mn | unshuffled_original_mn | 395605 | 181,307,167 | 2.2G | unshuffled_deduplicated_mn | 197878 | 68,362,013 | 838M |
Nahuatl languages | nah | unshuffled_original_nah | 61 | 1,234 | 12K | unshuffled_deduplicated_nah | 58 | 1,193 | 11K |
Neapolitan | nap | unshuffled_original_nap | 73 | 5,282 | 17K | unshuffled_deduplicated_nap | 55 | 4,147 | 13K |
Nepali | ne | unshuffled_original_ne | 299938 | 107,448,208 | 1.8G | unshuffled_deduplicated_ne | 219334 | 71,628,317 | 1.2G |
Newari | new | unshuffled_original_new | 4696 | 564,697 | 5.5M | unshuffled_deduplicated_new | 2126 | 288,995 | 4.1M |
Northern Frisian | frr | unshuffled_original_frr | 7 | 1,516 | 4.4K | unshuffled_deduplicated_frr | 7 | 1,516 | 4.4K |
Northern Luri | lrc | unshuffled_original_lrc | 88 | 8,022 | 76K | unshuffled_deduplicated_lrc | 72 | 6,740 | 63K |
Norwegian | no | unshuffled_original_no | 5546211 | 1,344,326,388 | 8.0G | unshuffled_deduplicated_no | 3229940 | 804,894,377 | 4.7G |
Norwegian Nynorsk | nn | unshuffled_original_nn | 185884 | 14,764,980 | 85M | unshuffled_deduplicated_nn | 109118 | 9,435,139 | 54M |
Occitan | oc | unshuffled_original_oc | 10709 | 750,301 | 5.8M | unshuffled_deduplicated_oc | 6485 | 512,678 | 3.7M |
Oriya | or | unshuffled_original_or | 59463 | 14,938,567 | 248M | unshuffled_deduplicated_or | 44230 | 11,321,740 | 188M |
Ossetian | os | unshuffled_original_os | 5213 | 1,031,268 | 13M | unshuffled_deduplicated_os | 2559 | 878,765 | 11M |
Pampanga | pam | unshuffled_original_pam | 3 | 130 | 760 | unshuffled_deduplicated_pam | 1 | 52 | 304 |
Panjabi | pa | unshuffled_original_pa | 127467 | 61,847,806 | 763M | unshuffled_deduplicated_pa | 87235 | 37,555,835 | 460M |
Persian | fa | unshuffled_original_fa | 13704702 | 9,096,554,121 | 79G | unshuffled_deduplicated_fa | 8203495 | 4,363,505,319 | 38G |
Piemontese | pms | unshuffled_original_pms | 3225 | 362,013 | 2.1M | unshuffled_deduplicated_pms | 2859 | 337,246 | 1.9M |
Polish | pl | unshuffled_original_pl | 35440972 | 15,277,255,137 | 109G | unshuffled_deduplicated_pl | 20682611 | 6,708,709,674 | 47G |
Portuguese | pt | unshuffled_original_pt | 42114520 | 20,641,903,898 | 124G | unshuffled_deduplicated_pt | 26920397 | 10,751,156,918 | 64G |
Pushto | ps | unshuffled_original_ps | 98216 | 46,559,441 | 361M | unshuffled_deduplicated_ps | 67921 | 31,347,348 | 242M |
Quechua | qu | unshuffled_original_qu | 452 | 10,186 | 78K | unshuffled_deduplicated_qu | 411 | 8,691 | 67K |
Romanian | ro | unshuffled_original_ro | 9387265 | 3,984,317,058 | 25G | unshuffled_deduplicated_ro | 5044757 | 1,741,794,069 | 11G |
Romansh | rm | unshuffled_original_rm | 41 | 1,093 | 7.4K | unshuffled_deduplicated_rm | 34 | 960 | 6.5K |
Russia Buriat | bxr | unshuffled_original_bxr | 42 | 963 | 13K | unshuffled_deduplicated_bxr | 36 | 809 | 11K |
Russian | ru | unshuffled_original_ru | 161836003 | 92,522,407,837 | 1.2T | unshuffled_deduplicated_ru | 115954598 | 46,692,691,520 | 568G |
Sanskrit | sa | unshuffled_original_sa | 14291 | 4,331,569 | 93M | unshuffled_deduplicated_sa | 7121 | 1,713,930 | 37M |
Scottish Gaelic | gd | unshuffled_original_gd | 5799 | 310,689 | 1.9M | unshuffled_deduplicated_gd | 3883 | 207,110 | 1.3M |
Serbian | sr | unshuffled_original_sr | 1013619 | 364,395,411 | 3.9G | unshuffled_deduplicated_sr | 645747 | 207,561,168 | 2.2G |
Serbo-Croatian | sh | unshuffled_original_sh | 36700 | 5,292,184 | 25M | unshuffled_deduplicated_sh | 17610 | 1,040,573 | 5.8M |
Sicilian | scn | unshuffled_original_scn | 21 | 554 | 3.3K | unshuffled_deduplicated_scn | 17 | 468 | 2.8K |
Sindhi | sd | unshuffled_original_sd | 44280 | 43,530,158 | 347M | unshuffled_deduplicated_sd | 33925 | 33,028,015 | 263M |
Sinhala | si | unshuffled_original_si | 203082 | 93,053,465 | 1.4G | unshuffled_deduplicated_si | 120684 | 50,864,857 | 802M |
Slovak | sk | unshuffled_original_sk | 5492194 | 1,322,247,763 | 9.1G | unshuffled_deduplicated_sk | 2820821 | 656,346,179 | 4.5G |
Slovenian | sl | unshuffled_original_sl | 1746604 | 387,399,700 | 2.5G | unshuffled_deduplicated_sl | 886223 | 193,926,684 | 1.3G |
Somali | so | unshuffled_original_so | 156 | 1,202 | 61K | unshuffled_deduplicated_so | 42 | 472 | 16K |
South Azerbaijani | azb | unshuffled_original_azb | 15446 | 2,175,054 | 27M | unshuffled_deduplicated_azb | 9985 | 1,528,709 | 19M |
Spanish | es | unshuffled_original_es | 88199221 | 47,545,122,279 | 278G | unshuffled_deduplicated_es | 56326016 | 25,928,290,729 | 149G |
Sundanese | su | unshuffled_original_su | 805 | 30,321 | 211K | unshuffled_deduplicated_su | 511 | 20,278 | 141K |
Swahili | sw | unshuffled_original_sw | 41986 | 2,211,927 | 13M | unshuffled_deduplicated_sw | 24803 | 1,376,963 | 8.1M |
Swedish | sv | unshuffled_original_sv | 17395625 | 7,155,994,312 | 44G | unshuffled_deduplicated_sv | 11014487 | 4,106,120,608 | 25G |
Tagalog | tl | unshuffled_original_tl | 458206 | 98,949,299 | 573M | unshuffled_deduplicated_tl | 294132 | 70,121,601 | 407M |
Tajik | tg | unshuffled_original_tg | 89002 | 31,758,142 | 379M | unshuffled_deduplicated_tg | 56259 | 21,029,893 | 249M |
Tamil | ta | unshuffled_original_ta | 1263280 | 420,537,132 | 9.3G | unshuffled_deduplicated_ta | 833101 | 226,013,330 | 5.1G |
Tatar | tt | unshuffled_original_tt | 135923 | 51,034,893 | 670M | unshuffled_deduplicated_tt | 82738 | 23,825,695 | 305M |
Telugu | te | unshuffled_original_te | 475703 | 123,711,517 | 2.5G | unshuffled_deduplicated_te | 312644 | 79,094,167 | 1.6G |
Thai | th | unshuffled_original_th | 6064129 | 951,743,087 | 36G | unshuffled_deduplicated_th | 3749826 | 368,965,202 | 16G |
Tibetan | bo | unshuffled_original_bo | 26795 | 1,483,589 | 187M | unshuffled_deduplicated_bo | 15762 | 936,556 | 138M |
Turkish | tr | unshuffled_original_tr | 18535253 | 7,577,388,700 | 60G | unshuffled_deduplicated_tr | 11596446 | 3,365,734,289 | 27G |
Turkmen | tk | unshuffled_original_tk | 6456 | 1,113,869 | 11M | unshuffled_deduplicated_tk | 4694 | 752,326 | 6.8M |
Tuvinian | tyv | unshuffled_original_tyv | 34 | 759 | 12K | unshuffled_deduplicated_tyv | 24 | 540 | 7.9K |
Uighur | ug | unshuffled_original_ug | 22255 | 8,657,141 | 122M | unshuffled_deduplicated_ug | 15503 | 5,852,225 | 83M |
Ukrainian | uk | unshuffled_original_uk | 12973467 | 4,204,381,276 | 53G | unshuffled_deduplicated_uk | 7782375 | 2,252,380,351 | 28G |
Upper Sorbian | hsb | unshuffled_original_hsb | 7959 | 545,351 | 4.2M | unshuffled_deduplicated_hsb | 3084 | 236,867 | 1.8M |
Urdu | ur | unshuffled_original_ur | 638596 | 331,817,982 | 2.7G | unshuffled_deduplicated_ur | 428674 | 218,030,228 | 1.7G |
Uzbek | uz | unshuffled_original_uz | 27537 | 2,450,256 | 21M | unshuffled_deduplicated_uz | 15074 | 1,381,644 | 12M |
Venetian | vec | unshuffled_original_vec | 73 | 3,492 | 18K | unshuffled_deduplicated_vec | 64 | 3,199 | 17K |
Vietnamese | vi | unshuffled_original_vi | 14898250 | 12,036,845,359 | 68G | unshuffled_deduplicated_vi | 9897709 | 5,577,159,843 | 32G |
Volapük | vo | unshuffled_original_vo | 3366 | 321,121 | 2.0M | unshuffled_deduplicated_vo | 3317 | 318,568 | 2.0M |
Walloon | wa | unshuffled_original_wa | 1001 | 50,720 | 273K | unshuffled_deduplicated_wa | 677 | 37,543 | 203K |
Waray | war | unshuffled_original_war | 9760 | 397,315 | 2.5M | unshuffled_deduplicated_war | 9161 | 336,311 | 2.2M |
Welsh | cy | unshuffled_original_cy | 157698 | 37,422,441 | 213M | unshuffled_deduplicated_cy | 98225 | 23,574,673 | 133M |
Western Frisian | fy | unshuffled_original_fy | 33053 | 5,691,077 | 35M | unshuffled_deduplicated_fy | 20661 | 4,223,816 | 26M |
Western Mari | mrj | unshuffled_original_mrj | 757 | 93,338 | 1.2M | unshuffled_deduplicated_mrj | 669 | 87,780 | 1.1M |
Western Panjabi | pnb | unshuffled_original_pnb | 4599 | 1,426,986 | 12M | unshuffled_deduplicated_pnb | 3463 | 1,111,112 | 9.0M |
Wu Chinese | wuu | unshuffled_original_wuu | 214 | 11,189 | 109K | unshuffled_deduplicated_wuu | 64 | 4,333 | 32K |
Yakut | sah | unshuffled_original_sah | 22301 | 2,547,623 | 42M | unshuffled_deduplicated_sah | 8555 | 1,789,174 | 26M |
Yiddish | yi | unshuffled_original_yi | 59364 | 13,834,320 | 141M | unshuffled_deduplicated_yi | 32919 | 8,212,970 | 84M |
Yoruba | yo | unshuffled_original_yo | 214 | 8,906 | 55K | unshuffled_deduplicated_yo | 49 | 3,518 | 27K |
Yue Chinese | yue | unshuffled_original_yue | 11 | 186 | 3.7K | unshuffled_deduplicated_yue | 7 | 128 | 2.2K |
OSCAR was constructed new pipeline derived from the fastText's one , called goclassy . Goclassy reuses the fastText linear classifier and the pre-trained fastText model for language recognition, but it completely rewrites and parallelises their pipeline in an asynchronous manner.
The order of operations is more or less the same as in the fastText pre-processing pipeline but instead of clustering multiple operations into a single blocking process, a worker is launched for each operation but bounding the number of possible parallel operations at a given time by the number of available threads instead of the number of CPUs. Goclassy is implemented in the Go programming language so it lets the Go runtime handle the scheduling of the processes. Thus the goclassy's pipeline one does not have to wait for a whole WET file to download, decompress and classify in order to start downloading and processing the next one, a new file will start downloading and processing as soon as the scheduler is able to allocate a new process.
Filtering and cleaning processes at line level are done before feeding each line to the classifier. Lines shorter than 100 UTF-8 characters and lines containing invalid UTF-8 characters are discarted and are not classified. After all files are proccesed the deduplicated versions are constructed and everything is then splitted in shards and compressed.
Common Crawl is a non-profit foundation which produces and maintains an open repository of web crawled data that is both accessible and analysable. Common Crawl's complete web archive consists of petabytes of data collected over 8 years of web crawling. The repository contains raw web page HTML data (WARC files), metdata extracts (WAT files) and plain text extracts (WET files). The organisation's crawlers has always respected nofollow and robots.txt policies.
Each monthly Common Crawl snapshot is in itself a massive multilingual corpus, where every single file contains data coming from multiple web pages written in a large variety of languages and covering all possible types of topics.
To construct OSCAR the WET files of Common Crawl were used. These contain the extracted plain texts from the websites mostly converted to UTF-8, as well as headers containing the metatada of each crawled document. Each WET file comes compressed in gzip format and is stored on Amazon Web Services. In the case of OSCAR, the November 2018 snapshot was used. It surpasses 20TB of uncompressed data and contains more than 50 thousand plain text files where each file consists of the plain text from multiple websites along its metadata header.
Who are the source language producers?The data comes from multiple web pages in a large variety of languages.
The dataset does not contain any additional annotations.
Annotation processN/A
Who are the annotators?N/A
Being constructed from Common Crawl, Personal and sensitive information might be present. This must be considered before training deep learning models with OSCAR, specially in the case of text-generation models.
OSCAR is intended to bring more data to a wide variety of lanuages, the aim of the corpus is to make large amounts of data available to lower resource languages in order to facilitate the pre-training of state-of-the-art language modeling architectures.
OSCAR is not properly filtered yet and this can be reflected on the models trained with it. Care is advised specially concerning biases of the resulting models.
The fastText linear classifier is limed both in performance and the variety of languages it can recognize, so the quality of some OSCAR sub-corpora might be lower than expected, specially for the lowest-resource langiuages. Some audits have already been done by third parties .
The corpus was put together by Pedro J. Ortiz , Benoît Sagot , and Laurent Romary , during work done at Inria , particularly at the ALMAnaCH team .
These data are released under this licensing scheme
We do not own any of the text from which these data has been extracted.
We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/
To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR
This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
* Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
* Clearly identify the copyrighted work claimed to be infringed.
* Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
@inproceedings{ortiz-suarez-etal-2020-monolingual,
title = "A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages",
author = "Ortiz Su{'a}rez, Pedro Javier and
Romary, Laurent and
Sagot, Benoit",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.156",
pages = "1703--1714",
abstract = "We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.",
}
@inproceedings{OrtizSuarezSagotRomary2019,
author = {Pedro Javier {Ortiz Su{'a}rez} and Benoit Sagot and Laurent Romary},
title = {Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures},
series = {Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019},
editor = {Piotr Bański and Adrien Barbaresi and Hanno Biber and Evelyn Breiteneder and Simon Clematide and Marc Kupietz and Harald L{"u}ngen and Caroline Iliadi},
publisher = {Leibniz-Institut f{"u}r Deutsche Sprache},
address = {Mannheim},
doi = {10.14618/ids-pub-9021},
url = {http://nbn-resolving.de/urn:nbn:de:bsz:mh39-90215},
pages = {9 -- 16},
year = {2019},
abstract = {Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to it, this data lacks any information about the language in which each document is written, making it extremely difficult to use Common Crawl for monolingual applications. We propose a general, highly parallel, multithreaded pipeline to clean and classify Common Crawl by language; we specifically design it so that it runs efficiently on medium to low resource infrastructures where I/O speeds are the main constraint. We develop the pipeline so that it can be easily reapplied to any kind of heterogeneous corpus and so that it can be parameterised to a wide range of infrastructures. We also distribute a 6.3TB version of Common Crawl, filtered, classified by language, shuffled at line level in order to avoid copyright issues, and ready to be used for NLP applications.},
language = {en}
}