数据集:

mlengineer-ai/jomleh

中文

Dataset Card for "jomleh"

Dataset Summary

"Jomleh" is a high-quality Farsi language dataset consisting of sentences that have been carefully preprocessed to ensure they contain only Farsi characters, without any contamination from other languages. The data has been sourced from multiple sources and undergone a deduplication process to ensure that each sentence is unique. While the text in the dataset is not original, the focus on quality over quantity ensures that each sentence is useful and informative. Each sample in "Jomleh" is a sentence, making it a valuable resource for natural language processing tasks and language modeling.

This dataset is composed of 227M Farsi sentences, taking 13 GB in compressed files (39 GB decompressed).

Sample code to load this dataset

This is how you can use this dataset:

from datasets import load_dataset


dataset = load_dataset("mlengineer-ai/jomleh", split="train")

for example in dataset:
    print("id: ", example["id"])
    print("sentence: ", example["text"])
    print("source: ", example["source"])

Since the whole dataset is one train slice, in case you needed test (or any other) slice, you can slice it any way you like this way:

from datasets import load_dataset


dataset = load_dataset("mlengineer-ai/jomleh", split="train[:95%]")

for example in dataset:
    print("id: ", example["id"])
    print("sentence: ", example["text"])
    print("source: ", example["source"])

Source Data

The data used to curate Jomleh is taken from the following sources:

Number of samples contributed by each source

Source Code Number of samples
OSCAR oscar_2109 72,646,870
OSCAR oscar_2201 53,583,646
OSCAR oscar_2301 72,157,974
CommonCrawl cc 22,596,629
Leipzig web-2019_1M 387,098
Leipzig web-2019_10K 3,597
Leipzig web-2019_30K 10,790
Leipzig web-2019_100K 35,833
Leipzig web-2019_300K 106,932
Leipzig news_2019_10K 3,542
Leipzig news_2019_30K 10,256
Leipzig news_2019_100K 31,967
Leipzig news_2019_300K 75,117
Leipzig news_2020_10K 2,609
Leipzig news_2020_30K 7,714
Leipzig news_2020_100K 24,815
Leipzig news_2020_300K 65,336
Leipzig newscrawl_2011_1M 419,538
Leipzig newscrawl_2015_1M 419,455
Leipzig newscrawl_2015_10K 3,569
Leipzig newscrawl_2015_30K 10,779
Leipzig newscrawl_2015_100K 35,481
Leipzig newscrawl_2015_300K 105,316
Leipzig newscrawl_2016_1M 332,953
Leipzig newscrawl_2016_10K 2,225
Leipzig newscrawl_2016_30K 6,396
Leipzig newscrawl_2016_100K 21,312
Leipzig newscrawl_2016_300K 61,081
Leipzig newscrawl_2017_1M 246,362
Leipzig newscrawl_2017_10K 1,368
Leipzig newscrawl_2017_30K 4,016
Leipzig newscrawl_2017_100K 13,334
Leipzig newscrawl_2017_300K 38,218
Leipzig newscrawl_2019_1M 298,688
Leipzig newscrawl_2019_10K 1,954
Leipzig newscrawl_2019_30K 5,641
Leipzig newscrawl_2019_100K 18,821
Leipzig newscrawl_2019_300K 53,830
Leipzig wikipedia_2010_10K 2,143
Leipzig wikipedia_2010_30K 6,262
Leipzig wikipedia_2010_100K 19,379
Leipzig wikipedia_2010_300K 46,844
Leipzig wikipedia_2012_10K 1,525
Leipzig wikipedia_2012_30K 4,517
Leipzig wikipedia_2012_100K 14,503
Leipzig wikipedia_2012_300K 38,298
Leipzig wikipedia_2014_1M 143,336
Leipzig wikipedia_2014_10K 597
Leipzig wikipedia_2014_30K 1,931
Leipzig wikipedia_2014_100K 6,031
Leipzig wikipedia_2014_300K 16,645
VOA Persian voa 116,671
Persian poems corpus poems 1,016,806
Web to Corpus w2c 1,629,616
TEP tep 488,558

Layout and Structure

The dataset is composed of 60 JSON-line files. As the samples are spread across these files randomly (using a uniform distribution), the number of samples per each file is not an exact number but generally speaking, there are roughly an equal number of samples per each file (roughly 190,000 samples per file).

Each line of a file is a sample formatted in JSON with the following layout:

{
  "id": <A sequential integer>,
  "text": "<A Farsi sentence>",
  "source": "<One of codes mentioned in the table above>"
}

Data curation process

1. Preprocessing

The value of this dataset is its preprocessing step. The main struggle working with Farsi text is the fact that due to some historical challenges, there are so many different codings out there used to save Farsi text. On top of that, you can add the complexity of dealing with multiple character codes for the same letter. In Farsi, the look of a character depends on its neighbouring characters. For example, consider the very last letter of Farsi alphabet "Ye":

It has an isolated form:

ﯼ - Unicode: &#64508

But when surronded with other characters, its medial form is used:

ﯿ - Unicode: &#64511

The correct way of typing the "Yeh" letter is to use its character code (Unicode U+06CC A.K.A. &#1740). That means to render, its correct form should be selected based on its surroundings. This requirement is usually taken care of by the "substitution table" which is a feature of the fonts. But at the same time, some text don't rely on the fonts and use the Unicode designed for the specific form of the letters directly. From the readers' point of the view, both will look identical but printing the code, you'll have different numbers. This complicates text processing in Farsi since we need to identify each character with a unique code regardless of their position in the word. On top of that, add the problem of using Arabic characters which some times are used to type Farsi text. Since the two languages share very similar alphabets (visually speaking), one can successfully read a text in Farsi while it's been typed using Arabic characters.

To address these problems, the preprocessing used in Jomleh tries its best to map all the different characters that look alike to their Farsi counterpart. This is not an exact science but based on the best effort. For instance, if a sentence is actually an Arabic sentence, the preprocessing script used here will make things worse. But assuming that all the text used here as source are 100% Farsi, this script should help make them uniform.

The same cleaning process is also applied to digits and puncuations.

At the end, any character that can be found in the Jomleh dataset is one of the following:

  • a Farsi alphabet letter ( ا to ی )
  • one of the: آ , أ , ؤ , ئ
  • a Farsi digit ( ۹ to ۰ )
  • a zero-width non-joiner ( \u200c )
  • a space
  • one of the Farsi punctuations ( . , ! , ؟ , ، , ؛ )

Any other character found in the text is eliminated based on best effort and if the elimination of such characters could harm the integrity of the sentence, then that sentence is removed from the dataset altogether.

The script used for the preprocessing can be found here .

It's also worth mentioning that the preprocessing script will convert the text into vertical format which is expected by the third step (deduplication). Simply put, in vertical format spaces are replaced with a line feed. And also they are surrounded with a <doc> tag. Here's an example sample converted into vertical format:

<doc id="poems_merged.txt">
این
درگه
ما
درگه
نومیدی
نیست.
</doc>

In this example, the id attribute of the <doc> tag points to the file where the sample is coming from.

This is the command that executes the preprocessing script:

find 1_prepared -name "*.txt" | parallel 'python ./preprocess.py $(basename {}) < {} > ./2_cleaned_vertical/$(basename {})'

2. Merging into one text file

Once the raw source data was preprocessed, they are merged into a single large text file. This can easily be accomplished using a single command:

cat ./2_cleaned_vertical/* > ./3_temp/clean_merged.vert

3. Deduplication

Once all the text is transformed into vertical format and saved into a single text file, the onion program is used to eliminate any duplicate samples. You can find the onion program from this website and it is used here like this:

onion -sm -n 5 -t 0.5 ./3_temp/clean_merged.vert > ./3_temp/deduplicated.vert

4. Postprocessing

The postprocessing involves:

  • Converting back from vertical format into a single line per sample format.
  • Mapping the file names mentioned in the id attribute of the <doc> tag into one of the codes mentioned above.
  • Formatting each sample as a JSON-line (one json per line).
  • Distributing and saving the samples randomly across 60 files, trying to get relatively same number of samples per file.
  • These steps are run using the following command:

    python ./postprocess.py ./3_temp < ./3_temp/deduplicated.vert | parallel "echo '{}' | python ./add_id.py ./3_temp ./jomleh/files"
    

    5. Compressing the files

    The generated JSON-line files are compressed using Zstandard - Real-time data compression algorithm:

    find ./jomleh/files/*.jsonl -type f | parallel 'zstd --rm {}'
    

    6. Generating the checksum file

    The checksum file plays a dual role. Firstly, it keeps the checksum for each of 60 files for future verification. And also, it plays the role of index so the script can list and load the files. This is how the checksum file is generated:

    ls ./jomleh/files/*.zst | sort -t _ -k 2 -n | xargs sha256sum > ./jomleh/files/checksum.sha256
    

    Statistics

    After applying all the steps mentioned above, the curated dataset has the following statistics:

    Statistics on the collected sentences
    Total number of sentences: 227,404,724
    Average number of characters in a sentence: 101.16
    Standard deviation of the number of characters in a sentence: 88.86
    Average number of words in a sentence: 19.93
    Standard devitaion of the number of words in a sentence: 17.54
    Average number of characters in a word: 4.12
    Standard devitaion of the number of characters in a word: 1.99