数据集:
Sakonii/nepalitext-language-model-dataset
"NepaliText" language modeling dataset is a collection of over 13 million Nepali text sequences (phrases/sentences/paragraphs) extracted by combining the datasets: OSCAR , cc100 and a set of scraped Nepali articles on Wikipedia.
This dataset is intended to pre-train language models and word representations on Nepali Language.
The data is focused on Nepali language, but may have instances of other languages as well.
An example:
{'text': 'घरेलु मैदानमा भएको च्याम्पियन्स लिगको दोस्रो लेगमा एथ्लेटिको मड्रिडले आर्सनललाई एक शून्यले हराउँदै समग्रमा दुई एकको अग्रताका साथ फाइनलमा प्रवेश गरेको हो ।\n'}
The data fields are:
train | test |
---|---|
13141222 | 268189 |
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
The dataset does not contain any additional annotations.
Annotation process[More Information Needed]
Who are the annotators?[More Information Needed]
Being extracted and scraped from variety of internet sources, Personal and sensitive information might be present. This must be considered before training deep learning models, specially in the case of text-generation models.
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Thanks to @Sakonii for adding this dataset.