Summary: Dataset of posts and comments from pikabu.ru , a website that is Russian Reddit/9gag.
Script: convert_pikabu.py
Point of Contact: Ilya Gusev
Languages: Mostly Russian.
Prerequisites:
pip install datasets zstandard jsonlines pysimdjson
Dataset iteration:
from datasets import load_dataset dataset = load_dataset('IlyaGusev/pikabu', split="train", streaming=True) for example in dataset: print(example["text_markdown"])
{ "id": 69911642, "title": "Что можно купить в Китае за цену нового iPhone 11 Pro", "text_markdown": "...", "timestamp": 1571221527, "author_id": 2900955, "username": "chinatoday.ru", "rating": -4, "pluses": 9, "minuses": 13, "url": "...", "tags": ["Китай", "AliExpress", "Бизнес"], "blocks": {"data": ["...", "..."], "type": ["text", "text"]}, "comments": { "id": [152116588, 152116426], "text_markdown": ["...", "..."], "text_html": ["...", "..."], "images": [[], []], "rating": [2, 0], "pluses": [2, 0], "minuses": [0, 0], "author_id": [2104711, 2900955], "username": ["FlyZombieFly", "chinatoday.ru"] } }
You can use this little helper to unflatten sequences:
def revert_flattening(records): fixed_records = [] for key, values in records.items(): if not fixed_records: fixed_records = [{} for _ in range(len(values))] for i, value in enumerate(values): fixed_records[i][key] = value return fixed_records
The dataset is not anonymized, so individuals' names can be found in the dataset. Information about the original authors is included in the dataset where possible.