数据集:

opentensor/openvalidators-test

大小:

1M<n<10M

许可:

mit

数据集介绍文件清单

中文

Dataset Card for Openvalidators dataset

Dataset Summary

The OpenValidators dataset, created by the OpenTensor Foundation, is a continuously growing collection of data generated by the OpenValidators project in W&B . It contains hundreds of thousands of records and serves researchers, data scientists, and miners in the Bittensor network. The dataset provides information on network performance, node behaviors, and wandb run details. Researchers can gain insights and detect patterns, while data scientists can use it for training models and analysis. Miners can use the generated data to fine-tune their models and enhance their incentives in the network. The dataset's continuous updates support collaboration and innovation in decentralized computing.

How to use

The datasets library allows you to load and pre-process your dataset in pure Python, at scale.

The OpenValidators dataset gives you the granularity of extracting data by run_id , by OpenValidators version and by multiple OpenValidators versions. The dataset can be downloaded and prepared in one call to your local drive by using the load_dataset function.

Downloading by run id

For example, to download the data for a specific run, simply specify the corresponding OpenValidators version and the wandb run id in the format version/raw_data/run_id.parquet :

from datasets import load_dataset

version = '1.0.4' # OpenValidators version
run_id = '0plco3n0' # WandB run id
run_id_dataset = load_dataset('opentensor/openvalidators-test', data_files=f'{version}/raw_data/{run_id}.parquet')

Please note that only completed run_ids are included in the dataset. Runs that are still in progress will be ingested shortly after they finish.

Downloading by OpenValidators version

One can also leverage the datasets library to download all the runs within a determined OpenValidators version. That can be useful for researchers and data enthusiasts that are looking to do analysis in a specific OpenValidators version state.

from datasets import load_dataset

version = '1.0.4' # Openvalidators version
version_dataset = load_dataset('opentensor/openvalidators-test', data_files=f'{version}/raw_data/*')

Downloading by multiple OpenValidators version

Utilizing the datasets library, users can efficiently download runs from multiple OpenValidators versions. By accessing data from various OpenValidators versions, users can undertake downstream tasks such as data fine-tuning for mining or to perform big data analysis.

from datasets import load_dataset

versions = ['1.0.0', '1.0.1', '1.0.2', '1.0.4'] # Desired versions for extraction
data_files = [f'{version}/raw_data/*' for version in versions] # Set data files directories
dataset = load_dataset('opentensor/openvalidators-test', data_files={ 'test': data_files })

Analyzing metadata

All the state related to the details of the wandb data ingestion can be accessed easily using pandas and hugging face datasets structure. This data contains relevant information regarding the metadata of the run, including user information, config information and ingestion state.

import pandas as pd

version = '1.0.4' # OpenValidators version for metadata analysis
df = pd.read_csv(f'hf://datasets/opentensor/openvalidators-test/{version}/metadata.csv')

Dataset Structure

Data Instances

versioned raw_data

The data is provided as-in the wandb logs, without further preprocessing or tokenization. This data is located at version/raw_data where each file is a wandb run.

metadata

This dataset defines the current state of the wandb data ingestion by run id .

Data Fields

Raw data

The versioned raw_data collected from W&B follows the following schema:

_runtime : (float64) Runtime of the event
_step : (int64) Step of the event
_timestamp : (float64) Timestamp of the event
answer_completions : (list(string)) Completions of the answer_prompt
answer_prompt : (string) Prompt used to generate the answer
answer_rewards : (list(float64)) Rewards of the answer responses
answer_times : (list(float64)) Elapsed time of answer responses
answer_uids : (list(int32)) UIDs of nodes that answered the answer_prompt
base_prompt : (string) Bootstrap prompt
best_answer : (string) Best answer response
best_followup : (string) Best followup response
block : (float64) Subtensor current block
followup_completions : (list(string)) Completions of the base_prompt
followup_rewards : (list(float64)) Rewards of the followup responses
followup_times : (list(float64)) Ellapsed time of followup responses
followup_uids : (list(int64)) UIDs of nodes that answered the base_prompt
gating_loss : (float64) Gating model loss
gating_scorings : (list(float64)) Gating model scores
moving_averaged_scores : (list(float64)) Moving averaged scores at the time of the event
set_weights : (list(list(float64))) Processed weights of nodes by uid
step_length : (float64) Time difference from beginning of forward call to event logging

Metadata

run_id : (string) Wandb Run Id
completed : (boolean) Flag indicating if the run_id is completed (finished, crashed or killed)
downloaded : (boolean) Flag indicating if the run_id data has been downloaded
last_checkpoint : (string) Last checkpoint of the run_id
hotkey : (string) Hotkey associated with the run_id
openvalidators_version : (string) Version of OpenValidators associated with the run_id
problematic : (boolean) Flag indicating if the run_id data had problems to be ingested
problematic_reason : (string) Reason for the run_id being problematic (Exception message)
wandb_json_config : (string) JSON configuration associated with the run_id in Wandb
wandb_run_name : (string) Name of the Wandb run
wandb_user_info : (string) Username information associated with the Wandb run
wandb_tags : (list) List of tags associated with the Wandb run
wandb_createdAt : (string) Timestamp of the run creation in Wandb

Dataset Creation

Curation Rationale

This dataset was curated to provide a comprehensive and reliable collection of historical data obtained by the execution of different OpenValidators in the bittensor network. The goal is to support researchers, data scientists and developers with data generated in the network, facilitating the discovery of new insights, network analysis, troubleshooting, and data extraction for downstream tasks like mining.

Source Data

Initial Data Collection and Normalization

The initial data collection process for this dataset involves recurrent collection by a specialized worker responsible for extracting data from wandb and ingesting it into the Hugging Face datasets structure. The collected data is organized based on the OpenValidators version and run ID to facilitate efficient data management and granular access. Each run is collected based on its corresponding OpenValidators version tag and grouped into version-specific folders. Within each version folder, a metadata.csv file is included to manage the collection state, while the raw data of each run is saved in the .parquet format with the file name corresponding to the run ID (e.g., run_id.parquet ). Please note that the code for this data collection process will be released for transparency and reproducibility.

Who are the source language producers?

The language producers for this dataset are all the openvalidators that are logging their data into wandb in conjunction of other nodes of the bittensor network. The main wandb page where the data is sent can be accessed at https://wandb.ai/opentensor-dev/openvalidators/table .

Licensing Information

The dataset is licensed under the MIT License

Supported Tasks and Leaderboards

[More Information Needed]

Citation Information

[More Information Needed]

Contributions

[More Information Needed]

作者:

opentensor

数据集大小:

657.09 MB