The OpenValidators dataset, created by the OpenTensor Foundation, is a continuously growing collection of data generated by the OpenValidators project in W&B . It contains hundreds of thousands of records and serves researchers, data scientists, and miners in the Bittensor network. The dataset provides information on network performance, node behaviors, and wandb run details. Researchers can gain insights and detect patterns, while data scientists can use it for training models and analysis. Miners can use the generated data to fine-tune their models and enhance their incentives in the network. The dataset's continuous updates support collaboration and innovation in decentralized computing.
The datasets library allows you to load and pre-process your dataset in pure Python, at scale.
The OpenValidators dataset gives you the granularity of extracting data by run_id , by OpenValidators version and by multiple OpenValidators versions. The dataset can be downloaded and prepared in one call to your local drive by using the load_dataset function.
Downloading by run id
For example, to download the data for a specific run, simply specify the corresponding OpenValidators version and the wandb run id in the format version/raw_data/run_id.parquet :
from datasets import load_dataset version = '1.0.4' # OpenValidators version run_id = '0plco3n0' # WandB run id run_id_dataset = load_dataset('opentensor/openvalidators-test', data_files=f'{version}/raw_data/{run_id}.parquet')
Please note that only completed run_ids are included in the dataset. Runs that are still in progress will be ingested shortly after they finish.
Downloading by OpenValidators version
One can also leverage the datasets library to download all the runs within a determined OpenValidators version. That can be useful for researchers and data enthusiasts that are looking to do analysis in a specific OpenValidators version state.
from datasets import load_dataset version = '1.0.4' # Openvalidators version version_dataset = load_dataset('opentensor/openvalidators-test', data_files=f'{version}/raw_data/*')
Downloading by multiple OpenValidators version
Utilizing the datasets library, users can efficiently download runs from multiple OpenValidators versions. By accessing data from various OpenValidators versions, users can undertake downstream tasks such as data fine-tuning for mining or to perform big data analysis.
from datasets import load_dataset versions = ['1.0.0', '1.0.1', '1.0.2', '1.0.4'] # Desired versions for extraction data_files = [f'{version}/raw_data/*' for version in versions] # Set data files directories dataset = load_dataset('opentensor/openvalidators-test', data_files={ 'test': data_files })
Analyzing metadata
All the state related to the details of the wandb data ingestion can be accessed easily using pandas and hugging face datasets structure. This data contains relevant information regarding the metadata of the run, including user information, config information and ingestion state.
import pandas as pd version = '1.0.4' # OpenValidators version for metadata analysis df = pd.read_csv(f'hf://datasets/opentensor/openvalidators-test/{version}/metadata.csv')
versioned raw_data
The data is provided as-in the wandb logs, without further preprocessing or tokenization. This data is located at version/raw_data where each file is a wandb run.
metadata
This dataset defines the current state of the wandb data ingestion by run id .
Raw data
The versioned raw_data collected from W&B follows the following schema:
Metadata
This dataset was curated to provide a comprehensive and reliable collection of historical data obtained by the execution of different OpenValidators in the bittensor network. The goal is to support researchers, data scientists and developers with data generated in the network, facilitating the discovery of new insights, network analysis, troubleshooting, and data extraction for downstream tasks like mining.
The initial data collection process for this dataset involves recurrent collection by a specialized worker responsible for extracting data from wandb and ingesting it into the Hugging Face datasets structure. The collected data is organized based on the OpenValidators version and run ID to facilitate efficient data management and granular access. Each run is collected based on its corresponding OpenValidators version tag and grouped into version-specific folders. Within each version folder, a metadata.csv file is included to manage the collection state, while the raw data of each run is saved in the .parquet format with the file name corresponding to the run ID (e.g., run_id.parquet ). Please note that the code for this data collection process will be released for transparency and reproducibility.
Who are the source language producers?The language producers for this dataset are all the openvalidators that are logging their data into wandb in conjunction of other nodes of the bittensor network. The main wandb page where the data is sent can be accessed at https://wandb.ai/opentensor-dev/openvalidators/table .
The dataset is licensed under the MIT License
[More Information Needed]
[More Information Needed]
[More Information Needed]