Summary: Based on http://panchenko.me/data/russe/librusec_fb2.plain.gz . Uploaded here for convenience. Additional cleaning was performed.
Script: create_librusec.py
Point of Contact: Ilya Gusev
Languages: Russian.
Prerequisites:
pip install datasets zstandard jsonlines pysimdjson
Dataset iteration:
from datasets import load_dataset dataset = load_dataset('IlyaGusev/librusec', split="train", streaming=True) for example in dataset: print(example["text"])