Summary: A subset if Taiga , uploaded here for convenience. Additional cleaning was performed.
Script: create_stihi.py
Point of Contact: Ilya Gusev
Languages: Russian.
Prerequisites:
pip install datasets zstandard jsonlines pysimdjson
Dataset iteration:
from datasets import load_dataset dataset = load_dataset('IlyaGusev/stihi_ru', split="train", streaming=True) for example in dataset: print(example["text"])
The dataset is not anonymized, so individuals' names can be found in the dataset. Information about the original authors is included in the dataset where possible.