Summary: Dataset of questions, answers, and comments from ru.stackoverflow.com .
Script: create_stackoverflow.py
Point of Contact: Ilya Gusev
Languages: The dataset is in Russian with some programming code.
Prerequisites:
pip install datasets zstandard jsonlines pysimdjson
Loading:
from datasets import load_dataset dataset = load_dataset('IlyaGusev/ru_stackoverflow', split="train") for example in dataset: print(example["text_markdown"]) print()
{ "question_id": 11235, "answer_count": 1, "url": "https://ru.stackoverflow.com/questions/11235", "score": 2, "tags": ["c++", "сериализация"], "title": "Извлечение из файла, запись в файл", "views": 1309, "author": "...", "timestamp": 1303205289, "text_html": "...", "text_markdown": "...", "comments": { "text": ["...", "...", "author": ["...", "..."], "comment_id": [11236, 11237], "score": [0, 0], "timestamp": [1303205411, 1303205678] }, "answers": { "answer_id": [11243, 11245], "timestamp": [1303207791, 1303207792], "is_accepted": [1, 0], "text_html": ["...", "..."], "text_markdown": ["...", "..."], "score": [3, 0], "author": ["...", "..."], "comments": { "text": ["...", "..."], "author": ["...", "..."], "comment_id": [11246, 11249], "score": [0, 0], "timestamp": [1303207961, 1303207800] } } }
You can use this little helper to unflatten sequences:
def revert_flattening(records): fixed_records = [] for key, values in records.items(): if not fixed_records: fixed_records = [{} for _ in range(len(values))] for i, value in enumerate(values): fixed_records[i][key] = value return fixed_records
The original JSONL is already unflattened.
The dataset is not anonymized, so individuals' names can be found in the dataset. Information about the original authors is included in the dataset where possible.
According to the license of original data, this dataset is distributed under CC BY-SA 2.5 .