数据集:

opentensor/openvalidators

许可:

mit

大小:

1M<n<10M

数据集介绍文件清单

中文

Dataset Card for Openvalidators dataset

Dataset Summary

The OpenValidators dataset, created by the OpenTensor Foundation, is a continuously growing collection of data generated by the OpenValidators project in W&B . It contains millions of records and serves researchers, data scientists, and miners in the Bittensor network. The dataset provides information on network performance, node behaviors, and wandb run details. Researchers can gain insights and detect patterns, while data scientists can use it for training models and analysis. Miners can use the generated data to fine-tune their models and enhance their incentives in the network. The dataset's continuous updates support collaboration and innovation in decentralized computing.

Version support and revisions

This dataset is in constant evolution, so in order to facilitate data management, each data schema is versioned in a hugging face dataset branch, so legacy data can be easily retrieved.

The main branch (or default revision) will always be the latest version of the dataset, following the latest schema adopted by the openvalidators.

The current state of data organization is as following:

v1.0 : All data collected from the first openvalidators schema, ranging from version 1.0.0 to 1.0.8 .
main : Current state of the dataset, following the latest schema adopted by the openvalidators (>= 1.1.0 ).

How to use

The datasets library allows you to load and pre-process your dataset in pure Python, at scale.

The OpenValidators dataset gives you the granularity of extracting data by run_id , by OpenValidators version and by multiple OpenValidators versions. The dataset can be downloaded and prepared in one call to your local drive by using the load_dataset function.

Downloading by run id

For example, to download the data for a specific run, simply specify the corresponding OpenValidators version and the wandb run id in the format version/raw_data/run_id.parquet :

from datasets import load_dataset

version = '1.1.0' # OpenValidators version
run_id = '0drg98iy' # WandB run id
run_id_dataset = load_dataset('opentensor/openvalidators', data_files=f'{version}/raw_data/{run_id}.parquet')

Please note that only completed run_ids are included in the dataset. Runs that are still in progress will be ingested shortly after they finish.

Downloading by OpenValidators version

One can also leverage the datasets library to download all the runs within a determined OpenValidators version. That can be useful for researchers and data enthusiasts that are looking to do analysis in a specific OpenValidators version state.

from datasets import load_dataset

version = '1.1.0' # Openvalidators version
version_dataset = load_dataset('opentensor/openvalidators', data_files=f'{version}/raw_data/*')

Downloading by multiple OpenValidators version

Utilizing the datasets library, users can efficiently download runs from multiple OpenValidators versions. By accessing data from various OpenValidators versions, users can undertake downstream tasks such as data fine-tuning for mining or to perform big data analysis.

from datasets import load_dataset

versions = ['1.1.0', '1.1.1', ...] # Desired versions for extraction
data_files = [f'{version}/raw_data/*' for version in versions] # Set data files directories
dataset = load_dataset('opentensor/openvalidators', data_files={ 'test': data_files })

Downloading legacy data using revisions

from datasets import load_dataset

version = '1.0.4' # OpenValidators version
run_id = '0plco3n0' # WandB run id
revision = 'v1.0' # Dataset revision
run_id_dataset = load_dataset('opentensor/openvalidators', data_files=f'{version}/raw_data/{run_id}.parquet', revision=revision)

Note: You can interact with legacy data in all the ways mentioned above, as long as your data scope is within the same revision.

Analyzing metadata

All the state related to the details of the wandb data ingestion can be accessed easily using pandas and hugging face datasets structure. This data contains relevant information regarding the metadata of the run, including user information, config information and ingestion state.

import pandas as pd

version = '1.1.0' # OpenValidators version for metadata analysis
df = pd.read_csv(f'hf://datasets/opentensor/openvalidators/{version}/metadata.csv')

Dataset Structure

Data Instances

versioned raw_data

The data is provided as-in the wandb logs, without further preprocessing or tokenization. This data is located at version/raw_data where each file is a wandb run.

metadata

This dataset defines the current state of the wandb data ingestion by run id .

Data Fields

Raw data

The versioned raw_data collected from W&B follows the following schema:

rewards : (float64) Reward vector for given step
completion_times : (float64) List of completion times for a given prompt
completions : (string) List of completions received for a given prompt
_runtime : (float64) Runtime of the event
_timestamp : (float64) Timestamp of the event
name : (string) Prompt type, e.g. 'followup', 'answer', 'augment'
block : (float64) Current block at given step
gating_loss : (float64) Gating model loss for given step
rlhf_reward_model : (float64) Output vector of the rlhf reward model
relevance_filter : (float64) Output vector of the relevance scoring reward model
dahoas_reward_model : (float64) Output vector of the dahoas reward model
blacklist_filter :(float64) Output vector of the blacklist filter
nsfw_filter :(float64) Output vector of the nsfw filter
prompt_reward_model :(float64) Output vector of the prompt reward model
reciprocate_reward_model :(float64) Output vector of the reciprocate reward model
diversity_reward_model :(float64) Output vector of the diversity reward model
set_weights : (float64) Output vector of the set weights
uids :(int64) Queried uids
_step : (int64) Step of the event
prompt : (string) Prompt text string
step_length : (float64) Elapsed time between the beginning of a run step to the end of a run step
best : (string) Best completion for given prompt

Metadata

run_id : (string) Wandb Run Id
completed : (boolean) Flag indicating if the run_id is completed (finished, crashed or killed)
downloaded : (boolean) Flag indicating if the run_id data has been downloaded
last_checkpoint : (string) Last checkpoint of the run_id
hotkey : (string) Hotkey associated with the run_id
openvalidators_version : (string) Version of OpenValidators associated with the run_id
problematic : (boolean) Flag indicating if the run_id data had problems to be ingested
problematic_reason : (string) Reason for the run_id being problematic (Exception message)
wandb_json_config : (string) JSON configuration associated with the run_id in Wandb
wandb_run_name : (string) Name of the Wandb run
wandb_user_info : (string) Username information associated with the Wandb run
wandb_tags : (list) List of tags associated with the Wandb run
wandb_createdAt : (string) Timestamp of the run creation in Wandb

Dataset Creation

Curation Rationale

This dataset was curated to provide a comprehensive and reliable collection of historical data obtained by the execution of different OpenValidators in the bittensor network. The goal is to support researchers, data scientists and developers with data generated in the network, facilitating the discovery of new insights, network analysis, troubleshooting, and data extraction for downstream tasks like mining.

Source Data

Initial Data Collection and Normalization

The initial data collection process for this dataset involves recurrent collection by a specialized worker responsible for extracting data from wandb and ingesting it into the Hugging Face datasets structure. The collected data is organized based on the OpenValidators version and run ID to facilitate efficient data management and granular access. Each run is collected based on its corresponding OpenValidators version tag and grouped into version-specific folders. Within each version folder, a metadata.csv file is included to manage the collection state, while the raw data of each run is saved in the .parquet format with the file name corresponding to the run ID (e.g., run_id.parquet ). Please note that the code for this data collection process will be released for transparency and reproducibility.

Who are the source language producers?

The language producers for this dataset are all the openvalidators that are logging their data into wandb in conjunction of other nodes of the bittensor network. The main wandb page where the data is sent can be accessed at https://wandb.ai/opentensor-dev/openvalidators/table .

Licensing Information

The dataset is licensed under the MIT License

Supported Tasks and Leaderboards

[More Information Needed]

Citation Information

[More Information Needed]

Contributions

[More Information Needed]

作者:

opentensor

数据集大小:

2.18 GB