Summary: Dataset of posts and comments from habr.com , a Russian collaborative blog about IT, computer science and anything related to the Internet.
Script: create_habr.py
Point of Contact: Ilya Gusev
Languages: Russian, English, some programming code.
Prerequisites:
pip install datasets zstandard jsonlines pysimdjson
Dataset iteration:
from datasets import load_dataset dataset = load_dataset('IlyaGusev/habr', split="train", streaming=True) for example in dataset: print(example["text_markdown"])
{ "id": 12730, "language": "ru", "url": "https://habr.com/ru/post/12730/", "text_markdown": "...", "text_html": "...", "lead_markdown": "...", "lead_html": "...", "type": "article", "labels": [], "original_author": null, "original_url": null, "time_published": 1185962380, "author": "...", "title": "Хочешь в университет — сделай презентацию", "statistics": { "commentsCount": 23, "favoritesCount": 1, "readingCount": 1542, "score": 7, "votesCount": 15, "votesCountPlus": 11, "votesCountMinus": 4 }, "hubs": [ "itcompanies" ], "flows": [ "popsci" ], "tags": [ "PowerPoint", "презентация", "абитуриенты", ], "reading_time": 1, "format": null, "complexity": null, "comments": { "id": [11653537, 11653541], "parent_id": [null, 11653537], "level": [0, 1], "time_published": [1185963192, 1185967886], "score": [-1, 0], "votes": [1, 0], "message_html": ["...", "..."], "author": ["...", "..."], "children": [[11653541], []] } }
You can use this little helper to unflatten sequences:
def revert_flattening(records): fixed_records = [] for key, values in records.items(): if not fixed_records: fixed_records = [{} for _ in range(len(values))] for i, value in enumerate(values): fixed_records[i][key] = value return fixed_records
The original JSONL is already unflattened.
The dataset is not anonymized, so individuals' names can be found in the dataset. Information about the original authors is included in the dataset where possible.