数据集:

IlyaGusev/librusec

语言:

ru

大小:

100K<n<1M
中文

Librusec dataset

Description

Summary: Based on http://panchenko.me/data/russe/librusec_fb2.plain.gz . Uploaded here for convenience. Additional cleaning was performed.

Script: create_librusec.py

Point of Contact: Ilya Gusev

Languages: Russian.

Usage

Prerequisites:

pip install datasets zstandard jsonlines pysimdjson

Dataset iteration:

from datasets import load_dataset
dataset = load_dataset('IlyaGusev/librusec', split="train", streaming=True)
for example in dataset:
    print(example["text"])