数据集:

opentensor/openvalidators

许可:

mit

大小:

1M<n<10M
中文

Dataset Card for Openvalidators dataset

Dataset Summary

The OpenValidators dataset, created by the OpenTensor Foundation, is a continuously growing collection of data generated by the OpenValidators project in W&B . It contains millions of records and serves researchers, data scientists, and miners in the Bittensor network. The dataset provides information on network performance, node behaviors, and wandb run details. Researchers can gain insights and detect patterns, while data scientists can use it for training models and analysis. Miners can use the generated data to fine-tune their models and enhance their incentives in the network. The dataset's continuous updates support collaboration and innovation in decentralized computing.

Version support and revisions

This dataset is in constant evolution, so in order to facilitate data management, each data schema is versioned in a hugging face dataset branch, so legacy data can be easily retrieved.

The main branch (or default revision) will always be the latest version of the dataset, following the latest schema adopted by the openvalidators.

The current state of data organization is as following:

  • v1.0 : All data collected from the first openvalidators schema, ranging from version 1.0.0 to 1.0.8 .
  • main : Current state of the dataset, following the latest schema adopted by the openvalidators (>= 1.1.0 ).

How to use

The datasets library allows you to load and pre-process your dataset in pure Python, at scale.

The OpenValidators dataset gives you the granularity of extracting data by run_id , by OpenValidators version and by multiple OpenValidators versions. The dataset can be downloaded and prepared in one call to your local drive by using the load_dataset function.

Downloading by run id

For example, to download the data for a specific run, simply specify the corresponding OpenValidators version and the wandb run id in the format version/raw_data/run_id.parquet :

from datasets import load_dataset

version = '1.1.0' # OpenValidators version
run_id = '0drg98iy' # WandB run id
run_id_dataset = load_dataset('opentensor/openvalidators', data_files=f'{version}/raw_data/{run_id}.parquet')

Please note that only completed run_ids are included in the dataset. Runs that are still in progress will be ingested shortly after they finish.

Downloading by OpenValidators version

One can also leverage the datasets library to download all the runs within a determined OpenValidators version. That can be useful for researchers and data enthusiasts that are looking to do analysis in a specific OpenValidators version state.

from datasets import load_dataset

version = '1.1.0' # Openvalidators version
version_dataset = load_dataset('opentensor/openvalidators', data_files=f'{version}/raw_data/*')

Downloading by multiple OpenValidators version

Utilizing the datasets library, users can efficiently download runs from multiple OpenValidators versions. By accessing data from various OpenValidators versions, users can undertake downstream tasks such as data fine-tuning for mining or to perform big data analysis.

from datasets import load_dataset

versions = ['1.1.0', '1.1.1', ...] # Desired versions for extraction
data_files = [f'{version}/raw_data/*' for version in versions] # Set data files directories
dataset = load_dataset('opentensor/openvalidators', data_files={ 'test': data_files })

Downloading legacy data using revisions

from datasets import load_dataset

version = '1.0.4' # OpenValidators version
run_id = '0plco3n0' # WandB run id
revision = 'v1.0' # Dataset revision
run_id_dataset = load_dataset('opentensor/openvalidators', data_files=f'{version}/raw_data/{run_id}.parquet', revision=revision)

Note: You can interact with legacy data in all the ways mentioned above, as long as your data scope is within the same revision.

Analyzing metadata

All the state related to the details of the wandb data ingestion can be accessed easily using pandas and hugging face datasets structure. This data contains relevant information regarding the metadata of the run, including user information, config information and ingestion state.

import pandas as pd

version = '1.1.0' # OpenValidators version for metadata analysis
df = pd.read_csv(f'hf://datasets/opentensor/openvalidators/{version}/metadata.csv')

Dataset Structure

Data Instances

versioned raw_data

The data is provided as-in the wandb logs, without further preprocessing or tokenization. This data is located at version/raw_data where each file is a wandb run.

metadata

This dataset defines the current state of the wandb data ingestion by run id .

Data Fields

Raw data

The versioned raw_data collected from W&B follows the following schema:

  • rewards : (float64) Reward vector for given step
  • completion_times : (float64) List of completion times for a given prompt
  • completions : (string) List of completions received for a given prompt
  • _runtime : (float64) Runtime of the event
  • _timestamp : (float64) Timestamp of the event
  • name : (string) Prompt type, e.g. 'followup', 'answer', 'augment'
  • block : (float64) Current block at given step
  • gating_loss : (float64) Gating model loss for given step
  • rlhf_reward_model : (float64) Output vector of the rlhf reward model
  • relevance_filter : (float64) Output vector of the relevance scoring reward model
  • dahoas_reward_model : (float64) Output vector of the dahoas reward model
  • blacklist_filter :(float64) Output vector of the blacklist filter
  • nsfw_filter :(float64) Output vector of the nsfw filter
  • prompt_reward_model :(float64) Output vector of the prompt reward model
  • reciprocate_reward_model :(float64) Output vector of the reciprocate reward model
  • diversity_reward_model :(float64) Output vector of the diversity reward model
  • set_weights : (float64) Output vector of the set weights
  • uids :(int64) Queried uids
  • _step : (int64) Step of the event
  • prompt : (string) Prompt text string
  • step_length : (float64) Elapsed time between the beginning of a run step to the end of a run step
  • best : (string) Best completion for given prompt

Metadata

  • run_id : (string) Wandb Run Id
  • completed : (boolean) Flag indicating if the run_id is completed (finished, crashed or killed)
  • downloaded : (boolean) Flag indicating if the run_id data has been downloaded
  • last_checkpoint : (string) Last checkpoint of the run_id
  • hotkey : (string) Hotkey associated with the run_id
  • openvalidators_version : (string) Version of OpenValidators associated with the run_id
  • problematic : (boolean) Flag indicating if the run_id data had problems to be ingested
  • problematic_reason : (string) Reason for the run_id being problematic (Exception message)
  • wandb_json_config : (string) JSON configuration associated with the run_id in Wandb
  • wandb_run_name : (string) Name of the Wandb run
  • wandb_user_info : (string) Username information associated with the Wandb run
  • wandb_tags : (list) List of tags associated with the Wandb run
  • wandb_createdAt : (string) Timestamp of the run creation in Wandb

Dataset Creation

Curation Rationale

This dataset was curated to provide a comprehensive and reliable collection of historical data obtained by the execution of different OpenValidators in the bittensor network. The goal is to support researchers, data scientists and developers with data generated in the network, facilitating the discovery of new insights, network analysis, troubleshooting, and data extraction for downstream tasks like mining.

Source Data

Initial Data Collection and Normalization

The initial data collection process for this dataset involves recurrent collection by a specialized worker responsible for extracting data from wandb and ingesting it into the Hugging Face datasets structure. The collected data is organized based on the OpenValidators version and run ID to facilitate efficient data management and granular access. Each run is collected based on its corresponding OpenValidators version tag and grouped into version-specific folders. Within each version folder, a metadata.csv file is included to manage the collection state, while the raw data of each run is saved in the .parquet format with the file name corresponding to the run ID (e.g., run_id.parquet ). Please note that the code for this data collection process will be released for transparency and reproducibility.

Who are the source language producers?

The language producers for this dataset are all the openvalidators that are logging their data into wandb in conjunction of other nodes of the bittensor network. The main wandb page where the data is sent can be accessed at https://wandb.ai/opentensor-dev/openvalidators/table .

Licensing Information

The dataset is licensed under the MIT License

Supported Tasks and Leaderboards

[More Information Needed]

Citation Information

[More Information Needed]

Contributions

[More Information Needed]