Monitoring and Metrics
This document covers the monitoring and observability infrastructure used in the NI Compute Subnet. The system employs two primary monitoring solutions: Weights & Biases (WandB) for distributed state management and experiment tracking, and Prometheus for network observability metrics.
For information about the database operations used for local state persistence, see Database Operations. For details about the communication protocols that generate monitored events, see Communication Protocols.
Architecture Overview
Section titled “Architecture Overview”The monitoring system operates across three main components: validators, miners, and the resource allocation API. WandB serves as the primary distributed state store, while Prometheus provides real-time metrics collection.
Monitoring Architecture
Section titled “Monitoring Architecture”graph TB subgraph "Validator Instances" VAL1["Validator Process<br/>neurons/validator.py"] VAL2["Validator Process<br/>neurons/validator.py"] VAL3["Validator Process<br/>neurons/validator.py"] end subgraph "Miner Instances" MIN1["Miner Process<br/>neurons/miner.py"] MIN2["Miner Process<br/>neurons/miner.py"] MIN3["Miner Process<br/>neurons/miner.py"] end subgraph "WandB Cloud Service" WANDB_PROJECT["opencompute Project<br/>neuralinternet/opencompute"] WANDB_RUNS["Individual Runs<br/>validator-{hotkey}<br/>miner-{hotkey}"] WANDB_CONFIG["Run Configurations<br/>Hardware Specs<br/>Allocation State<br/>Performance Stats"] end subgraph "Prometheus Infrastructure" PROM_AXON["ComputeSubnetAxon<br/>Prometheus Endpoints"] PROM_METRICS["Network Metrics<br/>Performance Data"] PROM_EXTRINSIC["prometheus_extrinsic<br/>Blockchain Integration"] end subgraph "Local State" COMPUTE_DB["ComputeDb<br/>SQLite Database"] WANDB_RUNS_TABLE["wandb_runs table<br/>hotkey -> run_id mapping"] end VAL1 --> WANDB_PROJECT VAL2 --> WANDB_PROJECT VAL3 --> WANDB_PROJECT MIN1 --> WANDB_PROJECT MIN2 --> WANDB_PROJECT MIN3 --> WANDB_PROJECT WANDB_PROJECT --> WANDB_RUNS WANDB_RUNS --> WANDB_CONFIG VAL1 --> PROM_AXON MIN1 --> PROM_AXON PROM_AXON --> PROM_METRICS PROM_AXON --> PROM_EXTRINSIC VAL1 --> COMPUTE_DB MIN1 --> COMPUTE_DB COMPUTE_DB --> WANDB_RUNS_TABLE
Sources: compute/wandb/wandb.py:1-648 , compute/axon.py:152-284
WandB Integration
Section titled “WandB Integration”The ComputeWandb
class provides the core monitoring infrastructure, managing distributed state across the network through individual WandB runs for each validator and miner.
WandB System Components
Section titled “WandB System Components”graph TD subgraph "ComputeWandb Class" INIT["__init__<br/>Run Initialization"] CONFIG["update_config<br/>Configuration Management"] SIGN["sign_run<br/>Cryptographic Signing"] VERIFY["verify_run<br/>Signature Verification"] end subgraph "Validator Operations" UPDATE_STATS["update_stats<br/>Challenge Results"] UPDATE_ALLOC["update_allocated_hotkeys<br/>Resource Allocation State"] UPDATE_PEN["update_penalized_hotkeys<br/>Blacklist Management"] GET_STATS["get_stats_allocated<br/>Cross-Validator Aggregation"] end subgraph "Miner Operations" UPDATE_SPECS["update_specs<br/>Hardware Specifications"] UPDATE_MINER_ALLOC["update_allocated<br/>Allocation Status"] UPDATE_PORT["update_miner_port_open<br/>Network Accessibility"] SYNC_ALLOC["sync_allocated<br/>State Synchronization"] end subgraph "Data Retrieval" GET_ALLOC["get_allocated_hotkeys<br/>Active Allocations"] GET_PEN["get_penalized_hotkeys<br/>Blacklisted Keys"] GET_SPECS["get_miner_specs<br/>Hardware Information"] end subgraph "WandB API Layer" API_INSTANCE["wandb.Api<br/>API Client"] PROJECT_REF["api.project<br/>opencompute"] RUNS_QUERY["api.runs<br/>Query Interface"] end INIT --> CONFIG CONFIG --> SIGN UPDATE_STATS --> SIGN UPDATE_ALLOC --> SIGN UPDATE_PEN --> SIGN UPDATE_SPECS --> SIGN UPDATE_MINER_ALLOC --> SIGN UPDATE_PORT --> SIGN GET_STATS --> VERIFY GET_ALLOC --> VERIFY GET_PEN --> VERIFY GET_SPECS --> VERIFY API_INSTANCE --> PROJECT_REF PROJECT_REF --> RUNS_QUERY RUNS_QUERY --> GET_STATS RUNS_QUERY --> GET_ALLOC RUNS_QUERY --> GET_PEN RUNS_QUERY --> GET_SPECS
Sources: compute/wandb/wandb.py:19-648
Run Management and Initialization
Section titled “Run Management and Initialization”Each network participant creates a WandB run with a standardized naming convention: {role}-{hotkey}
. The system handles run persistence through local database storage and automatic recovery.
Component | Description | Key Methods |
---|---|---|
Run Initialization | Creates or resumes WandB runs | __init__ , save_run_id , get_run_id |
Configuration Management | Updates run configuration with network state | update_config |
State Persistence | Stores run IDs in local SQLite database | Database operations in wandb_runs table |
Sources: compute/wandb/wandb.py:52-88 , compute/wandb/wandb.py:109-138
Hardware Specifications Tracking
Section titled “Hardware Specifications Tracking”Miners upload hardware specifications to enable validators to make informed allocation decisions. The update_specs
method integrates with the performance measurement system to provide encrypted hardware details.
graph LR subgraph "Hardware Detection" PERF_INFO["get_perf_info<br/>neurons/Validator/script.py"] GPU_SPECS["GPU Specifications<br/>Name, Count, Memory"] CPU_SPECS["CPU Specifications<br/>Cores, Architecture"] end subgraph "WandB Upload" UPDATE_SPECS["update_specs<br/>compute/wandb/wandb.py"] RUN_CONFIG["run.config<br/>specs field"] SIGNATURE["sign_run<br/>Cryptographic Verification"] end subgraph "Validator Consumption" GET_MINER_SPECS["get_miner_specs<br/>Query Interface"] QUERYABLE_UIDS["queryable_uids<br/>Network Participants"] ALLOCATION_LOGIC["Resource Allocation<br/>Decision Making"] end PERF_INFO --> GPU_SPECS PERF_INFO --> CPU_SPECS GPU_SPECS --> UPDATE_SPECS CPU_SPECS --> UPDATE_SPECS UPDATE_SPECS --> RUN_CONFIG RUN_CONFIG --> SIGNATURE GET_MINER_SPECS --> QUERYABLE_UIDS QUERYABLE_UIDS --> ALLOCATION_LOGIC
Sources: compute/wandb/wandb.py:140-159 , compute/wandb/wandb.py:540-574
State Management and Synchronization
Section titled “State Management and Synchronization”The WandB system maintains several critical state categories across the network, with built-in aggregation and conflict resolution mechanisms.
Allocation State Management
Section titled “Allocation State Management”graph TB subgraph "Validator State Updates" ALLOC_UPDATE["update_allocated_hotkeys<br/>List of Allocated Keys"] STATS_UPDATE["update_stats<br/>Performance Statistics"] PEN_UPDATE["update_penalized_hotkeys<br/>Blacklist Management"] end subgraph "Cross-Validator Aggregation" GET_ALLOC_HOTKEYS["get_allocated_hotkeys<br/>Aggregate Across Validators"] GET_STATS_ALLOC["get_stats_allocated<br/>Statistics Aggregation"] CONFLICT_RESOLUTION["pick_dominant_dict<br/>Score-Based Resolution"] end subgraph "Database Synchronization" RETRIEVE_STATS["retrieve_stats<br/>neurons/Validator/database/pog.py"] WRITE_STATS["write_stats<br/>neurons/Validator/database/pog.py"] COMPUTE_DB["ComputeDb<br/>Local State Storage"] end subgraph "Verification Layer" VERIFY_RUN["verify_run<br/>Signature Verification"] VALID_VALIDATORS["valid_validator_hotkeys<br/>Authorized Validators"] SIGNATURE_CHECK["Cryptographic Validation"] end ALLOC_UPDATE --> GET_ALLOC_HOTKEYS STATS_UPDATE --> GET_STATS_ALLOC PEN_UPDATE --> GET_ALLOC_HOTKEYS GET_STATS_ALLOC --> CONFLICT_RESOLUTION CONFLICT_RESOLUTION --> RETRIEVE_STATS RETRIEVE_STATS --> WRITE_STATS WRITE_STATS --> COMPUTE_DB GET_ALLOC_HOTKEYS --> VERIFY_RUN GET_STATS_ALLOC --> VERIFY_RUN VERIFY_RUN --> VALID_VALIDATORS VALID_VALIDATORS --> SIGNATURE_CHECK
Sources: compute/wandb/wandb.py:198-250 , compute/wandb/wandb.py:291-333 , compute/wandb/wandb.py:334-450
Statistics Aggregation Algorithm
Section titled “Statistics Aggregation Algorithm”The get_stats_allocated
method implements a sophisticated aggregation algorithm that resolves conflicts between multiple validator reports for the same miner UID.
Step | Process | Implementation |
---|---|---|
1. Collection | Query all validator runs with stats | WandB API filters |
2. Verification | Validate cryptographic signatures | verify_run method |
3. Filtering | Select entries with own_score=True and allocated=True | Boolean filtering |
4. Aggregation | Group by UID, collect multiple reports | Dictionary aggregation |
5. Resolution | Use pick_dominant_dict for conflict resolution | Counter-based selection |
6. Scoring | Prefer highest score in case of ties | Score comparison |
Sources: compute/wandb/wandb.py:334-450 , compute/wandb/wandb.py:391-426
Security and Verification
Section titled “Security and Verification”The monitoring system implements cryptographic verification to ensure data integrity and prevent tampering.
Signature Verification Process
Section titled “Signature Verification Process”graph TD subgraph "Signing Process" RUN_ID["run.id<br/>WandB Run Identifier"] DATA_HASH["SHA-256 Hash<br/>Computed from run_id"] WALLET_SIGN["wallet.hotkey.sign<br/>Cryptographic Signature"] CONFIG_UPDATE["run.config.update<br/>Store Signature"] end subgraph "Verification Process" EXTRACT_SIG["Extract Signature<br/>From run.config"] RECREATE_HASH["Recreate Hash<br/>From run_id"] KEYPAIR_VERIFY["bt.Keypair.verify<br/>Signature Validation"] RESULT["Verification Result<br/>Boolean"] end subgraph "Security Enforcement" VALID_HOTKEYS["valid_validator_hotkeys<br/>Authorized Keys"] ACCESS_CONTROL["Data Access Control<br/>Based on Verification"] FLAG_PARAM["flag Parameter<br/>Enforcement Toggle"] end RUN_ID --> DATA_HASH DATA_HASH --> WALLET_SIGN WALLET_SIGN --> CONFIG_UPDATE EXTRACT_SIG --> RECREATE_HASH RECREATE_HASH --> KEYPAIR_VERIFY KEYPAIR_VERIFY --> RESULT RESULT --> ACCESS_CONTROL VALID_HOTKEYS --> ACCESS_CONTROL FLAG_PARAM --> ACCESS_CONTROL
Sources: compute/wandb/wandb.py:576-591 , compute/wandb/wandb.py:592-616
Prometheus Integration
Section titled “Prometheus Integration”The Prometheus integration provides real-time metrics collection through custom Axon and Subtensor implementations.
Prometheus Architecture
Section titled “Prometheus Architecture”graph LR subgraph "Custom Implementations" COMPUTE_SUBTENSOR["ComputeSubnetSubtensor<br/>compute/axon.py"] SERVE_PROMETHEUS["serve_prometheus<br/>Method"] DO_SERVE["do_serve_prometheus<br/>Extrinsic Handler"] end subgraph "Blockchain Integration" PROMETHEUS_EXTRINSIC["prometheus_extrinsic<br/>compute/prometheus.py"] SUBSTRATE_CALL["substrate.compose_call<br/>SubtensorModule"] SIGNED_EXTRINSIC["create_signed_extrinsic<br/>Blockchain Transaction"] end subgraph "Network Deployment" AXON_SERVE["Custom Axon<br/>ComputeSubnetAxon"] METRICS_ENDPOINT["Prometheus Endpoint<br/>Port Configuration"] NETWORK_REGISTRATION["Blockchain Registration<br/>Network Visibility"] end COMPUTE_SUBTENSOR --> SERVE_PROMETHEUS SERVE_PROMETHEUS --> DO_SERVE DO_SERVE --> PROMETHEUS_EXTRINSIC PROMETHEUS_EXTRINSIC --> SUBSTRATE_CALL SUBSTRATE_CALL --> SIGNED_EXTRINSIC AXON_SERVE --> METRICS_ENDPOINT METRICS_ENDPOINT --> NETWORK_REGISTRATION
Sources: compute/axon.py:166-201 , compute/axon.py:203-283
Prometheus Extrinsic Submission
Section titled “Prometheus Extrinsic Submission”The serve_prometheus
method submits blockchain extrinsics to register Prometheus endpoints with the network, enabling distributed metrics collection.
Component | Function | Parameters |
---|---|---|
serve_prometheus | Main entry point | wallet , port , netuid , wait_for_inclusion , wait_for_finalization |
do_serve_prometheus | Extrinsic handler | call_params , retry logic with exponential backoff |
prometheus_extrinsic | Blockchain integration | Custom prometheus integration (imported) |
Sources: compute/axon.py:166-201 , compute/axon.py:203-283
Data Flow Integration
Section titled “Data Flow Integration”The monitoring system integrates with the core subnet operations through several key data flows.
Validator Monitoring Flow
Section titled “Validator Monitoring Flow”graph TD subgraph "Validation Process" POG_VALIDATION["Proof-of-GPU<br/>Validation"] CHALLENGE_RESULTS["Challenge Results<br/>Success/Failure"] SCORING["Score Calculation<br/>Performance Metrics"] end subgraph "Local Storage" COMPUTE_DB_WRITE["ComputeDb Write<br/>Local Statistics"] POG_STATS["pog_stats table<br/>GPU Performance"] STATS_TABLE["stats table<br/>Miner Scores"] end subgraph "WandB Logging" UPDATE_STATS_CALL["update_stats<br/>Aggregate Results"] WANDB_LOG["wandb.log<br/>Time Series Data"] CONFIG_UPDATE_STATS["run.config.update<br/>Latest State"] end subgraph "Cross-Network Sync" ALLOCATED_UPDATE["update_allocated_hotkeys<br/>Resource State"] PENALIZED_UPDATE["update_penalized_hotkeys<br/>Blacklist State"] SIGNATURE_APPLY["sign_run<br/>Cryptographic Proof"] end POG_VALIDATION --> CHALLENGE_RESULTS CHALLENGE_RESULTS --> SCORING SCORING --> COMPUTE_DB_WRITE COMPUTE_DB_WRITE --> POG_STATS COMPUTE_DB_WRITE --> STATS_TABLE SCORING --> UPDATE_STATS_CALL UPDATE_STATS_CALL --> WANDB_LOG WANDB_LOG --> CONFIG_UPDATE_STATS CONFIG_UPDATE_STATS --> ALLOCATED_UPDATE ALLOCATED_UPDATE --> PENALIZED_UPDATE PENALIZED_UPDATE --> SIGNATURE_APPLY
Sources: compute/wandb/wandb.py:186-196 , compute/wandb/wandb.py:198-230 , compute/wandb/wandb.py:232-249
Configuration and Environment
Section titled “Configuration and Environment”The monitoring system requires specific configuration for proper operation, including API credentials and network parameters.
Environment Requirements
Section titled “Environment Requirements”Requirement | Configuration | Source |
---|---|---|
WandB API Key | WANDB_API_KEY environment variable or .netrc file | Environment setup |
Project Configuration | PUBLIC_WANDB_NAME = "opencompute" | Hard-coded constant |
Entity Configuration | PUBLIC_WANDB_ENTITY = "neuralinternet" | Hard-coded constant |
Database Connection | ComputeDb() instance | Local SQLite database |
Sources: compute/wandb/wandb.py:15-16 , compute/wandb/wandb.py:38-45