数据集:

IlyaGusev/habr

中文

Habr dataset

Description

Summary: Dataset of posts and comments from habr.com , a Russian collaborative blog about IT, computer science and anything related to the Internet.

Script: create_habr.py

Point of Contact: Ilya Gusev

Languages: Russian, English, some programming code.

Usage

Prerequisites:

pip install datasets zstandard jsonlines pysimdjson

Dataset iteration:

from datasets import load_dataset
dataset = load_dataset('IlyaGusev/habr', split="train", streaming=True)
for example in dataset:
    print(example["text_markdown"])

Data Instances

{
  "id": 12730,
  "language": "ru",
  "url": "https://habr.com/ru/post/12730/",
  "text_markdown": "...",
  "text_html": "...",
  "lead_markdown": "...",
  "lead_html": "...",
  "type": "article",
  "labels": [],
  "original_author": null,
  "original_url": null,
  "time_published": 1185962380,
  "author": "...",
  "title": "Хочешь в университет — сделай презентацию",
  "statistics": {
    "commentsCount": 23,
    "favoritesCount": 1,
    "readingCount": 1542,
    "score": 7,
    "votesCount": 15,
    "votesCountPlus": 11,
    "votesCountMinus": 4
  },
  "hubs": [
    "itcompanies"
  ],
  "flows": [
    "popsci"
  ],
  "tags": [
    "PowerPoint",
    "презентация",
    "абитуриенты",
  ],
  "reading_time": 1,
  "format": null,
  "complexity": null,
  "comments": {
    "id": [11653537, 11653541],
    "parent_id": [null, 11653537],
    "level": [0, 1],
    "time_published": [1185963192, 1185967886],
    "score": [-1, 0],
    "votes": [1, 0],
    "message_html": ["...", "..."],
    "author": ["...", "..."],
    "children": [[11653541], []]
  }
}

You can use this little helper to unflatten sequences:

def revert_flattening(records):
    fixed_records = []
    for key, values in records.items():
        if not fixed_records:
            fixed_records = [{} for _ in range(len(values))]
        for i, value in enumerate(values):
            fixed_records[i][key] = value
    return fixed_records

The original JSONL is already unflattened.

Source Data

Personal and Sensitive Information

The dataset is not anonymized, so individuals' names can be found in the dataset. Information about the original authors is included in the dataset where possible.