Pikabu dataset

Description

Summary: Dataset of posts and comments from pikabu.ru , a website that is Russian Reddit/9gag.

Script: convert_pikabu.py

Point of Contact: Ilya Gusev

Languages: Mostly Russian.

Usage

Prerequisites:

pip install datasets zstandard jsonlines pysimdjson

Dataset iteration:

from datasets import load_dataset
dataset = load_dataset('IlyaGusev/pikabu', split="train", streaming=True)
for example in dataset:
    print(example["text_markdown"])

Data Instances

{
  "id": 69911642,
  "title": "Что можно купить в Китае за цену нового iPhone 11 Pro",
  "text_markdown": "...",
  "timestamp": 1571221527,
  "author_id": 2900955,
  "username": "chinatoday.ru",
  "rating": -4,
  "pluses": 9,
  "minuses": 13,
  "url": "...",
  "tags": ["Китай", "AliExpress", "Бизнес"],
  "blocks": {"data": ["...", "..."], "type": ["text", "text"]},
  "comments": {
    "id": [152116588, 152116426],
    "text_markdown": ["...", "..."],
    "text_html": ["...", "..."],
    "images": [[], []],
    "rating": [2, 0],
    "pluses": [2, 0],
    "minuses": [0, 0],
    "author_id": [2104711, 2900955],
    "username": ["FlyZombieFly", "chinatoday.ru"]
  }
}

You can use this little helper to unflatten sequences:

def revert_flattening(records):
    fixed_records = []
    for key, values in records.items():
        if not fixed_records:
            fixed_records = [{} for _ in range(len(values))]
        for i, value in enumerate(values):
            fixed_records[i][key] = value
    return fixed_records

Source Data

The data source is the Pikabu website.
An original dump can be found here: pikastat
Processing script is here .

Personal and Sensitive Information

The dataset is not anonymized, so individuals' names can be found in the dataset. Information about the original authors is included in the dataset where possible.

作者:

IlyaGusev

数据集大小:

18.81 GB