Prometheus Metrics
This document explains how the NI Compute system implements and utilizes Prometheus metrics for monitoring and observability. Prometheus is a popular open-source monitoring and alerting toolkit used to collect and query time-series metrics from various systems. In NI Compute, Prometheus metrics provide crucial insights into the performance and behavior of validators and miners in the decentralized GPU compute marketplace.
For information about general monitoring and integration with Weights & Biases, see WandB Integration.
1. Prometheus Metrics Architecture
Section titled “1. Prometheus Metrics Architecture”Prometheus metrics in NI Compute are implemented through a specialized registration system that integrates with the Bittensor blockchain. This integration ensures that monitoring endpoints are discoverable by other network participants.
flowchart TD
subgraph "Prometheus Metrics System"
PM["PrometheusMetrics"]
PE["prometheus_extrinsic()"]
DPS["do_serve_prometheus()"]
end
subgraph "Validator"
VI["init_prometheus()"]
SS["sync_status()"]
VI --> PE
SS --> VI
end
subgraph "Subtensor"
CSS["ComputeSubnetSubtensor"]
SP["serve_prometheus()"]
CSS --> SP
SP --> PE
end
subgraph "Blockchain"
BT["Bittensor Chain"]
PI["PrometheusInfo"]
end
VI --> SP
PE --> DPS
DPS --> BT
BT --> PI
style PM fill:#f9f9f9,stroke:#333,stroke-width:1px
style PE fill:#f9f9f9,stroke:#333,stroke-width:1px
style DPS fill:#f9f9f9,stroke:#333,stroke-width:1px
Sources: compute/axon.py:166-201 , compute/axon.py:49
2. Prometheus Registration Process
Section titled “2. Prometheus Registration Process”The registration process involves the validator registering its Prometheus metrics endpoint with the Bittensor blockchain, making it discoverable to other nodes and monitoring systems.
sequenceDiagram
participant Validator
participant Subtensor as ComputeSubnetSubtensor
participant Chain as Bittensor Blockchain
Validator->>Validator: init_prometheus()
Validator->>Subtensor: serve_prometheus(wallet, port, netuid)
Subtensor->>Subtensor: prometheus_extrinsic()
Note over Subtensor: Create PrometheusServeCallParams
Subtensor->>Subtensor: Check if neuron needs update
alt Needs Update
Subtensor->>Subtensor: do_serve_prometheus()
Subtensor->>Chain: Submit extrinsic
Chain->>Chain: Update PrometheusInfo
Chain-->>Subtensor: Confirm update
Subtensor-->>Validator: Return success
else Already Updated
Subtensor-->>Validator: Return success (already served)
end
Validator->>Validator: Log success or failure
Sources: compute/axon.py:166-201 , compute/axon.py:203-283
3. Prometheus Metrics Implementation
Section titled “3. Prometheus Metrics Implementation”3.1 Registration Functions
Section titled “3.1 Registration Functions”The NI Compute system implements several key functions to handle Prometheus metrics registration:
ComputeSubnetSubtensor.serve_prometheus()
Section titled “ComputeSubnetSubtensor.serve_prometheus()”The serve_prometheus() method in the ComputeSubnetSubtensor class coordinates Prometheus metrics registration on the blockchain. This method:
- Accepts wallet, port, netuid, and wait parameters
- Calls the
prometheus_extrinsic()function to handle the registration process - Returns a boolean indicating success or failure
- Includes comprehensive error handling and logging
ComputeSubnetSubtensor.do_serve_prometheus()
Section titled “ComputeSubnetSubtensor.do_serve_prometheus()”The do_serve_prometheus() method handles the low-level blockchain interaction:
- Composes a substrate call to
SubtensorModule.serve_prometheus - Creates a signed extrinsic using the wallet hotkey
- Implements retry logic with exponential backoff (3 tries, 2x backoff, max 4s delay)
- Submits the extrinsic and processes the response
- Returns a tuple of (success: bool, error: Optional[dict])
prometheus_extrinsic()
Section titled “prometheus_extrinsic()”The prometheus_extrinsic() function (imported from compute.prometheus) prepares the registration parameters and delegates to do_serve_prometheus().
Sources: compute/axon.py:166-201 , compute/axon.py:203-283 , compute/axon.py:258-283
3.2 Versioning and Auto-Update
Section titled “3.2 Versioning and Auto-Update”The system uses version information from __version_as_int__ for Prometheus metrics registration. The version is embedded in the axon info that gets registered on the blockchain.
flowchart TD
subgraph "Version Integration"
VI["__version_as_int__"]
AI["AxonInfo.version"]
CSA["ComputeSubnetAxon.info()"]
VI --> AI
CSA --> AI
end
subgraph "Registration Process"
SP["serve_prometheus()"]
PE["prometheus_extrinsic()"]
DSP["do_serve_prometheus()"]
SP --> PE
PE --> DSP
AI --> PE
end
subgraph "Blockchain"
BC["Bittensor Chain"]
PI["PrometheusInfo"]
DSP --> BC
BC --> PI
end
Sources: compute/axon.py:47 , compute/axon.py:376-388
4. Substrate Call Structure
Section titled “4. Substrate Call Structure”The Prometheus registration uses a substrate call to the SubtensorModule.serve_prometheus function. The call parameters structure is defined by the Bittensor substrate interface and includes:
| Parameter | Description | Implementation |
|---|---|---|
call_module | Target module name | "SubtensorModule" |
call_function | Target function name | "serve_prometheus" |
call_params | Registration parameters | PrometheusServeCallParams |
The method creates a signed extrinsic using the wallet hotkey and submits it to the substrate interface with configurable wait options for inclusion and finalization.
Sources: compute/axon.py:260-267 , compute/axon.py:206
5. Blockchain Integration
Section titled “5. Blockchain Integration”5.1 ComputeSubnetSubtensor Extension
Section titled “5.1 ComputeSubnetSubtensor Extension”The ComputeSubnetSubtensor class extends Bittensor’s base Subtensor class to add compute subnet-specific functionality:
classDiagram
class Subtensor {
<<Bittensor Core>>
+substrate: SubstrateInterface
+compose_call()
+create_signed_extrinsic()
}
class ComputeSubnetSubtensor {
+serve_prometheus(wallet, port, netuid)
+do_serve_prometheus(wallet, call_params)
-make_substrate_call_with_retry()
}
class prometheus_extrinsic {
<<Function>>
+invoke(wallet, port, netuid)
}
Subtensor <|-- ComputeSubnetSubtensor
ComputeSubnetSubtensor --> prometheus_extrinsic : calls
5.2 Extrinsic Submission Flow
Section titled “5.2 Extrinsic Submission Flow”The registration process uses Bittensor’s substrate interface for blockchain interaction:
- Call Composition: Uses
substrate.compose_call()to create the blockchain call - Extrinsic Creation: Creates a signed extrinsic with
substrate.create_signed_extrinsic() - Submission: Submits via
substrate.submit_extrinsic()with retry logic - Response Processing: Processes events and checks for success/failure
Sources: compute/axon.py:152-165 , compute/axon.py:260-283
6. Using Prometheus Metrics
Section titled “6. Using Prometheus Metrics”6.1 Accessing Metrics
Section titled “6.1 Accessing Metrics”Once registered, Prometheus metrics are available at:
http://<node_ip>:<prometheus_port>/metricsThe exact port is determined by the validator’s configuration, typically using Bittensor’s default axon port.
6.2 Common Metrics
Section titled “6.2 Common Metrics”While the specific metrics exposed aren’t explicitly documented in the code provided, Prometheus in Bittensor networks typically provides metrics such as:
- Node operational status
- Request counts and latencies
- Resource utilization (CPU, memory, GPU)
- Network activity
- Validation and scoring information
6.3 Integration with Monitoring Systems
Section titled “6.3 Integration with Monitoring Systems”To monitor NI Compute nodes using Prometheus:
- Configure a Prometheus server to scrape the registered endpoints
- Set up appropriate recording rules and alerts
- Use visualization tools like Grafana to create dashboards
7. Monitoring Lifecycle
Section titled “7. Monitoring Lifecycle”The NI Compute system ensures continuous monitoring availability through its lifecycle management:
flowchart LR
subgraph "Node Startup"
NS["Node Start"]
IR["Initialize Registration"]
end
subgraph "Periodic Checks"
SS["sync_status()"]
VC["Version Check"]
end
subgraph "Updates"
UP["Update Prometheus"]
RE["Re-register if needed"]
end
NS --> IR
IR --> SS
SS -- "Every sync cycle" --> VC
VC -- "Version mismatch" --> UP
VC -- "Needs refresh" --> RE
style NS fill:#f9f9f9,stroke:#333,stroke-width:1px
style IR fill:#f9f9f9,stroke:#333,stroke-width:1px
style SS fill:#f9f9f9,stroke:#333,stroke-width:1px
style VC fill:#f9f9f9,stroke:#333,stroke-width:1px
Sources: neurons/validator.py:428-446 , compute/prometheus.py:80-95
8. Error Handling and Reliability
Section titled “8. Error Handling and Reliability”8.1 Retry Logic Implementation
Section titled “8.1 Retry Logic Implementation”The do_serve_prometheus() method implements comprehensive retry logic for reliable blockchain communication:
flowchart LR
subgraph "Retry Configuration"
RC["@retry decorator"]
D["delay:1"]
T["tries:3"]
B["backoff:2"]
MD["max_delay:4"]
end
subgraph "Submission Process"
MSC["make_substrate_call_with_retry()"]
CC["compose_call()"]
CSE["create_signed_extrinsic()"]
SE["submit_extrinsic()"]
end
subgraph "Error Handling"
SRE["SubstrateRequestException"]
GE["General Exception"]
EL["Error Logging"]
end
RC --> MSC
MSC --> CC
CC --> CSE
CSE --> SE
SE --> SRE
SE --> GE
SRE --> EL
GE --> EL
8.2 Exception Management
Section titled “8.2 Exception Management”The system handles multiple types of errors:
- SubstrateRequestException: Caught and logged with formatted error messages
- General Exceptions: Unexpected errors are logged with full stack traces
- Response Processing: Success/failure determined by response event processing
8.3 Comprehensive Logging
Section titled “8.3 Comprehensive Logging”Error logging includes:
- Detailed exception information with
exc_info=True - Formatted error messages using
format_error_message() - Debug-level logging for successful operations
Sources: compute/axon.py:258-283 , compute/axon.py:244-256 , compute/axon.py:197-201
9. Summary
Section titled “9. Summary”Prometheus metrics in NI Compute provide essential observability into the decentralized GPU marketplace. The registration system ensures that metrics endpoints are discoverable through the Bittensor blockchain, allowing for comprehensive monitoring of validators and miners. Through version tracking and periodic updates, the system maintains monitoring capabilities even as the software evolves.