Russian StackOverflow dataset

Description

Summary: Dataset of questions, answers, and comments from ru.stackoverflow.com .

Script: create_stackoverflow.py

Point of Contact: Ilya Gusev

Languages: The dataset is in Russian with some programming code.

Usage

Prerequisites:

pip install datasets zstandard jsonlines pysimdjson

Loading:

from datasets import load_dataset
dataset = load_dataset('IlyaGusev/ru_stackoverflow', split="train")
for example in dataset:
    print(example["text_markdown"])
    print()

Data Instances

{
  "question_id": 11235,
  "answer_count": 1,
  "url": "https://ru.stackoverflow.com/questions/11235",
  "score": 2,
  "tags": ["c++", "сериализация"],
  "title": "Извлечение из файла, запись в файл",
  "views": 1309,
  "author": "...",
  "timestamp": 1303205289,
  "text_html": "...",
  "text_markdown": "...",
  "comments": {
    "text": ["...", "...",
    "author": ["...", "..."],
    "comment_id": [11236, 11237],
    "score": [0, 0],
    "timestamp": [1303205411, 1303205678]
  },
  "answers": {
    "answer_id": [11243, 11245],
    "timestamp": [1303207791, 1303207792],
    "is_accepted": [1, 0],
    "text_html": ["...", "..."],
    "text_markdown": ["...", "..."],
    "score": [3, 0],
    "author": ["...", "..."],
    "comments": {
      "text": ["...", "..."],
      "author": ["...", "..."],
      "comment_id": [11246, 11249],
      "score": [0, 0],
      "timestamp": [1303207961, 1303207800]
    }
  }
}

You can use this little helper to unflatten sequences:

def revert_flattening(records):
    fixed_records = []
    for key, values in records.items():
        if not fixed_records:
            fixed_records = [{} for _ in range(len(values))]
        for i, value in enumerate(values):
            fixed_records[i][key] = value
    return fixed_records

The original JSONL is already unflattened.

Source Data

The data source is the Russian StackOverflow website.
Original XMLs: ru.stackoverflow.com.7z .
Processing script is here .

Personal and Sensitive Information

The dataset is not anonymized, so individuals' names can be found in the dataset. Information about the original authors is included in the dataset where possible.

Licensing Information

According to the license of original data, this dataset is distributed under CC BY-SA 2.5 .

作者:

IlyaGusev

数据集大小:

639.42 MB