数据集:
mlengineer-ai/jomleh
"Jomleh" is a high-quality Farsi language dataset consisting of sentences that have been carefully preprocessed to ensure they contain only Farsi characters, without any contamination from other languages. The data has been sourced from multiple sources and undergone a deduplication process to ensure that each sentence is unique. While the text in the dataset is not original, the focus on quality over quantity ensures that each sentence is useful and informative. Each sample in "Jomleh" is a sentence, making it a valuable resource for natural language processing tasks and language modeling.
This dataset is composed of 227M Farsi sentences, taking 13 GB in compressed files (39 GB decompressed).
This is how you can use this dataset:
from datasets import load_dataset dataset = load_dataset("mlengineer-ai/jomleh", split="train") for example in dataset: print("id: ", example["id"]) print("sentence: ", example["text"]) print("source: ", example["source"])
Since the whole dataset is one train slice, in case you needed test (or any other) slice, you can slice it any way you like this way:
from datasets import load_dataset dataset = load_dataset("mlengineer-ai/jomleh", split="train[:95%]") for example in dataset: print("id: ", example["id"]) print("sentence: ", example["text"]) print("source: ", example["source"])
The data used to curate Jomleh is taken from the following sources:
Source | Code | Number of samples |
---|---|---|
OSCAR | oscar_2109 | 72,646,870 |
OSCAR | oscar_2201 | 53,583,646 |
OSCAR | oscar_2301 | 72,157,974 |
CommonCrawl | cc | 22,596,629 |
Leipzig | web-2019_1M | 387,098 |
Leipzig | web-2019_10K | 3,597 |
Leipzig | web-2019_30K | 10,790 |
Leipzig | web-2019_100K | 35,833 |
Leipzig | web-2019_300K | 106,932 |
Leipzig | news_2019_10K | 3,542 |
Leipzig | news_2019_30K | 10,256 |
Leipzig | news_2019_100K | 31,967 |
Leipzig | news_2019_300K | 75,117 |
Leipzig | news_2020_10K | 2,609 |
Leipzig | news_2020_30K | 7,714 |
Leipzig | news_2020_100K | 24,815 |
Leipzig | news_2020_300K | 65,336 |
Leipzig | newscrawl_2011_1M | 419,538 |
Leipzig | newscrawl_2015_1M | 419,455 |
Leipzig | newscrawl_2015_10K | 3,569 |
Leipzig | newscrawl_2015_30K | 10,779 |
Leipzig | newscrawl_2015_100K | 35,481 |
Leipzig | newscrawl_2015_300K | 105,316 |
Leipzig | newscrawl_2016_1M | 332,953 |
Leipzig | newscrawl_2016_10K | 2,225 |
Leipzig | newscrawl_2016_30K | 6,396 |
Leipzig | newscrawl_2016_100K | 21,312 |
Leipzig | newscrawl_2016_300K | 61,081 |
Leipzig | newscrawl_2017_1M | 246,362 |
Leipzig | newscrawl_2017_10K | 1,368 |
Leipzig | newscrawl_2017_30K | 4,016 |
Leipzig | newscrawl_2017_100K | 13,334 |
Leipzig | newscrawl_2017_300K | 38,218 |
Leipzig | newscrawl_2019_1M | 298,688 |
Leipzig | newscrawl_2019_10K | 1,954 |
Leipzig | newscrawl_2019_30K | 5,641 |
Leipzig | newscrawl_2019_100K | 18,821 |
Leipzig | newscrawl_2019_300K | 53,830 |
Leipzig | wikipedia_2010_10K | 2,143 |
Leipzig | wikipedia_2010_30K | 6,262 |
Leipzig | wikipedia_2010_100K | 19,379 |
Leipzig | wikipedia_2010_300K | 46,844 |
Leipzig | wikipedia_2012_10K | 1,525 |
Leipzig | wikipedia_2012_30K | 4,517 |
Leipzig | wikipedia_2012_100K | 14,503 |
Leipzig | wikipedia_2012_300K | 38,298 |
Leipzig | wikipedia_2014_1M | 143,336 |
Leipzig | wikipedia_2014_10K | 597 |
Leipzig | wikipedia_2014_30K | 1,931 |
Leipzig | wikipedia_2014_100K | 6,031 |
Leipzig | wikipedia_2014_300K | 16,645 |
VOA Persian | voa | 116,671 |
Persian poems corpus | poems | 1,016,806 |
Web to Corpus | w2c | 1,629,616 |
TEP | tep | 488,558 |
The dataset is composed of 60 JSON-line files. As the samples are spread across these files randomly (using a uniform distribution), the number of samples per each file is not an exact number but generally speaking, there are roughly an equal number of samples per each file (roughly 190,000 samples per file).
Each line of a file is a sample formatted in JSON with the following layout:
{ "id": <A sequential integer>, "text": "<A Farsi sentence>", "source": "<One of codes mentioned in the table above>" }
The value of this dataset is its preprocessing step. The main struggle working with Farsi text is the fact that due to some historical challenges, there are so many different codings out there used to save Farsi text. On top of that, you can add the complexity of dealing with multiple character codes for the same letter. In Farsi, the look of a character depends on its neighbouring characters. For example, consider the very last letter of Farsi alphabet "Ye":
It has an isolated form:
ﯼ - Unicode: ﯼ
But when surronded with other characters, its medial form is used:
ﯿ - Unicode: ﯿ
The correct way of typing the "Yeh" letter is to use its character code (Unicode U+06CC A.K.A. ی). That means to render, its correct form should be selected based on its surroundings. This requirement is usually taken care of by the "substitution table" which is a feature of the fonts. But at the same time, some text don't rely on the fonts and use the Unicode designed for the specific form of the letters directly. From the readers' point of the view, both will look identical but printing the code, you'll have different numbers. This complicates text processing in Farsi since we need to identify each character with a unique code regardless of their position in the word. On top of that, add the problem of using Arabic characters which some times are used to type Farsi text. Since the two languages share very similar alphabets (visually speaking), one can successfully read a text in Farsi while it's been typed using Arabic characters.
To address these problems, the preprocessing used in Jomleh tries its best to map all the different characters that look alike to their Farsi counterpart. This is not an exact science but based on the best effort. For instance, if a sentence is actually an Arabic sentence, the preprocessing script used here will make things worse. But assuming that all the text used here as source are 100% Farsi, this script should help make them uniform.
The same cleaning process is also applied to digits and puncuations.
At the end, any character that can be found in the Jomleh dataset is one of the following:
Any other character found in the text is eliminated based on best effort and if the elimination of such characters could harm the integrity of the sentence, then that sentence is removed from the dataset altogether.
The script used for the preprocessing can be found here .
It's also worth mentioning that the preprocessing script will convert the text into vertical format which is expected by the third step (deduplication). Simply put, in vertical format spaces are replaced with a line feed. And also they are surrounded with a <doc> tag. Here's an example sample converted into vertical format:
<doc id="poems_merged.txt"> این درگه ما درگه نومیدی نیست. </doc>
In this example, the id attribute of the <doc> tag points to the file where the sample is coming from.
This is the command that executes the preprocessing script:
find 1_prepared -name "*.txt" | parallel 'python ./preprocess.py $(basename {}) < {} > ./2_cleaned_vertical/$(basename {})'
Once the raw source data was preprocessed, they are merged into a single large text file. This can easily be accomplished using a single command:
cat ./2_cleaned_vertical/* > ./3_temp/clean_merged.vert
Once all the text is transformed into vertical format and saved into a single text file, the onion program is used to eliminate any duplicate samples. You can find the onion program from this website and it is used here like this:
onion -sm -n 5 -t 0.5 ./3_temp/clean_merged.vert > ./3_temp/deduplicated.vert
The postprocessing involves:
These steps are run using the following command:
python ./postprocess.py ./3_temp < ./3_temp/deduplicated.vert | parallel "echo '{}' | python ./add_id.py ./3_temp ./jomleh/files"
The generated JSON-line files are compressed using Zstandard - Real-time data compression algorithm:
find ./jomleh/files/*.jsonl -type f | parallel 'zstd --rm {}'
The checksum file plays a dual role. Firstly, it keeps the checksum for each of 60 files for future verification. And also, it plays the role of index so the script can list and load the files. This is how the checksum file is generated:
ls ./jomleh/files/*.zst | sort -t _ -k 2 -n | xargs sha256sum > ./jomleh/files/checksum.sha256
After applying all the steps mentioned above, the curated dataset has the following statistics:
Statistics on the collected sentences | |
---|---|
Total number of sentences: | 227,404,724 |
Average number of characters in a sentence: | 101.16 |
Standard deviation of the number of characters in a sentence: | 88.86 |
Average number of words in a sentence: | 19.93 |
Standard devitaion of the number of words in a sentence: | 17.54 |
Average number of characters in a word: | 4.12 |
Standard devitaion of the number of characters in a word: | 1.99 |