Resource Management
Purpose and Scope
Section titled “Purpose and Scope”The Resource Management system implements the core allocation logic within the Resource Allocation API, handling resource discovery, allocation strategies, health monitoring, and state synchronization. This system operates within the RegisterAPI
class and provides the intelligence for matching compute requirements with available miners, managing resource lifecycle, and maintaining distributed state.
For information about the API endpoints that expose these capabilities, see API Endpoints. For broader context about the validator system that validates miner capabilities, see Validator System.
Resource Discovery System
Section titled “Resource Discovery System”The resource discovery system identifies and evaluates candidate miners for allocation requests through a multi-stage process involving database queries, network validation, and scoring algorithms.
Candidate Selection Process
Section titled “Candidate Selection Process”flowchart TD DevReq["DeviceRequirement"] --> SelectCandidates["select_allocate_miners_hotkey()"] SelectCandidates --> CandidateList["candidates_hotkey[]"] CandidateList --> FilterAxons["Filter metagraph.axons"] FilterAxons --> AxonCandidates["axon_candidates[]"] AxonCandidates --> CheckAvailability["dendrite(Allocate(checking:True))"] CheckAvailability --> ValidateResponses["Validate responses"] ValidateResponses --> FinalCandidates["final_candidates_hotkey[]"] FinalCandidates --> ScoreSort["Score-based sorting"] ScoreSort --> SortedCandidates["sorted_hotkeys[]"] SortedCandidates --> AttemptAllocation["Attempt allocation"]
Candidate Discovery Flow
Sources: neurons/register_api.py:2741-2805
The system uses a two-phase approach for candidate selection:
- Database Filtering: The
select_allocate_miners_hotkey()
function queries the local database to find miners matching hardware requirements - Network Validation: Available candidates are validated through the Bittensor network using
Allocate
synapse withchecking=True
- Scoring and Prioritization: Valid candidates are sorted by their network scores to prioritize higher-performing miners
WandB-Based Resource Discovery
Section titled “WandB-Based Resource Discovery”flowchart TD WandbQuery["get_wandb_running_miners()"] --> FilterRuns["Filter running miner runs"] FilterRuns --> CheckPenalized["Check penalized_hotkeys"] CheckPenalized --> ValidateAge["miner_is_older_than(48h)"] ValidateAge --> ValidatePog["miner_pog_ok(2.5h)"] ValidatePog --> CheckActive["Verify in metagraph.axons"] CheckActive --> ExtractSpecs["Extract specs from run.config"] ExtractSpecs --> SpecsDetails["specs_details{}"] SpecsDetails --> ResourceList["resource_list[]"]
WandB Resource Discovery Flow
Sources: neurons/register_api.py:1646-1702 , neurons/register_api.py:1881-1884
The system implements an alternative discovery mechanism using WandB for distributed miner information:
Validation Check | Function | Purpose |
---|---|---|
Age Verification | miner_is_older_than() | Ensures miners have been active for 48+ hours |
PoG Validation | miner_pog_ok() | Confirms recent Proof-of-GPU completion within 2.5 hours |
Penalty Check | get_penalized_hotkeys_checklist() | Excludes blacklisted or penalized miners |
Network Presence | metagraph.axons lookup | Verifies miner is active on network |
Allocation Strategies
Section titled “Allocation Strategies”The system implements two primary allocation strategies: specification-based allocation and hotkey-specific allocation, each optimized for different use cases.
Specification-Based Allocation
Section titled “Specification-Based Allocation”flowchart TD AllocateSpec["_allocate_container()"] --> BuildDevReq["Build device_requirement{}"] BuildDevReq --> FindCandidates["select_allocate_miners_hotkey()"] FindCandidates --> CheckAvailable["dendrite(checking:True)"] CheckAvailable --> ScoreSort["torch.ones_like(metagraph.S)"] ScoreSort --> TryAllocation["Loop through sorted_hotkeys"] TryAllocation --> SendAllocate["dendrite(Allocate(checking:False))"] SendAllocate --> ValidateResponse["Check response.status"] ValidateResponse --> Success["Return allocation info"] ValidateResponse --> NextCandidate["Try next candidate"] NextCandidate --> TryAllocation
Specification-Based Allocation Flow
Sources: neurons/register_api.py:2733-2805
The specification-based strategy processes DeviceRequirement
objects containing:
- CPU count requirements
- GPU type and memory specifications
- RAM and storage capacity needs
- Timeline duration for allocation
Hotkey-Specific Allocation
Section titled “Hotkey-Specific Allocation”flowchart TD AllocateHotkey["_allocate_container_hotkey()"] --> FindAxon["Locate axon by hotkey"] FindAxon --> SetDefaults["Set default device_requirement{}"] SetDefaults --> SetBaseImage["docker_requirement.base_image"] SetBaseImage --> DirectAllocate["dendrite(Allocate())"] DirectAllocate --> ValidateResponse["Check response.status"] ValidateResponse --> ReturnInfo["Return allocation + miner_version"] ValidateResponse --> ReturnError["Return error message"]
Hotkey-Specific Allocation Flow
Sources: neurons/register_api.py:2807-2889
This strategy targets specific miners by hotkey, bypassing the discovery phase and applying default resource requirements with configurable Docker base images.
Health Monitoring System
Section titled “Health Monitoring System”The health monitoring system continuously tracks allocated resources and manages their lifecycle through automated checks and notifications.
Allocation Health Check Process
Section titled “Allocation Health Check Process”flowchart TD CheckTask["_check_allocation()"] --> QueryDB["SELECT * FROM allocation"] QueryDB --> LoopAllocations["For each allocation"] LoopAllocations --> ValidateHotkey["Check hotkey in metagraph"] ValidateHotkey --> HealthCheck["dendrite_check(Allocate(checking:True))"] HealthCheck --> EvaluateResponse["Evaluate response"] EvaluateResponse --> Online["response.status :: False"] EvaluateResponse --> Offline["No response or timeout"] Online --> NotifyOnline["_notify_allocation_status(ONLINE)"] Online --> RemoveFromCheck["Remove from checking_allocated[]"] Offline --> AddToCheck["Add to checking_allocated[]"] Offline --> NotifyOffline["_notify_allocation_status(OFFLINE)"] Offline --> CountChecks["checking_allocated.count(hotkey)"] CountChecks --> MaxReached["count >: ALLOCATE_CHECK_COUNT"] MaxReached --> Deallocate["update_allocation_db(False)"] Deallocate --> NotifyDealloc["_notify_allocation_status(DEALLOCATION)"]
Health Monitoring Process Flow
Sources: neurons/register_api.py:3002-3101
The health monitoring system operates with the following parameters:
Parameter | Value | Purpose |
---|---|---|
ALLOCATE_CHECK_PERIOD | 180 seconds | Interval between health checks |
ALLOCATE_CHECK_COUNT | 20 | Maximum failed checks before deallocation |
Health Check Timeout | 10 seconds | Maximum wait time for miner response |
Status Transition Management
Section titled “Status Transition Management”The system tracks miner status transitions and maintains allocation state through the checking_allocated
list and database updates.
Sources: neurons/register_api.py:3039-3081
State Management
Section titled “State Management”The Resource Management system maintains both local and distributed state through SQLite database operations and WandB synchronization.
Database State Operations
Section titled “Database State Operations”flowchart TD AllocDB["allocation table"] --> UpdateLocal["update_allocation_db()"] UpdateLocal --> ExtractHotkeys["Extract hotkey_list[]"] ExtractHotkeys --> UpdateWandB["_update_allocation_wandb()"] UpdateWandB --> WandbSync["wandb.update_allocated_hotkeys()"] QueryState["List operations"] --> QueryDB["SELECT hotkey, details FROM allocation"] QueryDB --> ParseJSON["json.loads(details)"] ParseJSON --> BuildAllocation["Build Allocation objects"] HealthCheck["Health monitoring"] --> StateUpdate["Allocation state changes"] StateUpdate --> NotifyRetry["notify_retry_table[]"] StateUpdate --> UpdateLocal
State Management Architecture
Sources: neurons/register_api.py:2891-2919 , neurons/register_api.py:1346-1419
Distributed State Synchronization
Section titled “Distributed State Synchronization”The system maintains consistency across validators through WandB-based state sharing:
State Component | Storage | Synchronization Method |
---|---|---|
Active Allocations | SQLite allocation table | _update_allocation_wandb() |
Allocated Hotkeys | WandB runs | wandb.update_allocated_hotkeys() |
Retry Notifications | In-memory notify_retry_table | Periodic retry processing |
Sources: neurons/register_api.py:2915-2919
Notification System
Section titled “Notification System”The notification system provides external webhook integration for allocation lifecycle events and status changes.
Notification Event Types
Section titled “Notification Event Types”flowchart TD EventTrigger["Allocation Event"] --> EventType{"Event Type"} EventType --> Deallocation["DEALLOCATION"] EventType --> Online["ONLINE"] EventType --> Offline["OFFLINE"] Deallocation --> BuildDeallocationMsg["Build deallocation message"] Online --> BuildStatusMsg["Build status change message"] Offline --> BuildStatusMsg BuildDeallocationMsg --> DeallocationURL["deallocation_notify_url"] BuildStatusMsg --> StatusURL["status_notify_url"] DeallocationURL --> SignAndSend["HMAC signature + POST"] StatusURL --> SignAndSend SignAndSend --> RetryLogic["Retry up to MAX_NOTIFY_RETRY"] RetryLogic --> Success["200/201 response"] RetryLogic --> AddToRetryTable["Add to notify_retry_table[]"]
Notification System Flow
Sources: neurons/register_api.py:2940-3000
Notification Configuration
Section titled “Notification Configuration”The notification system operates with these key parameters:
Parameter | Value | Purpose |
---|---|---|
MAX_NOTIFY_RETRY | 3 | Maximum notification attempts |
NOTIFY_RETRY_PERIOD | 15 seconds | Delay between retry attempts |
Webhook Signature | HMAC-SHA256 | Request authentication |
SSL Certificates | Required | Secure communication |
Sources: neurons/register_api.py:92-93 , neurons/register_api.py:2972-2976
Data Models and Configuration
Section titled “Data Models and Configuration”The Resource Management system uses several key data models and configuration parameters to define resource requirements and allocation responses.
Core Data Models
Section titled “Core Data Models”classDiagram class DeviceRequirement { +int cpu_count +str gpu_type +int gpu_size +int ram +int hard_disk +int timeline } class Allocation { +str resource +str hotkey +str regkey +str ssh_ip +int ssh_port +str ssh_username +str ssh_password +str ssh_command +str ssh_key +str uuid_key +int miner_version } class ResourceQuery { +str gpu_name +int cpu_count_min +int cpu_count_max +float gpu_capacity_min +float gpu_capacity_max +float hard_disk_total_min +float hard_disk_total_max +float ram_total_min +float ram_total_max } class DockerRequirement { +str base_image +str ssh_key +str volume_path +str dockerfile } DeviceRequirement --> Allocation : "Generates" ResourceQuery --> Allocation : "Filters" DockerRequirement --> Allocation : "Configures"
Resource Management Data Models
Sources: neurons/register_api.py:147-175 , neurons/register_api.py:156-167 , neurons/register_api.py:204-213
System Configuration Constants
Section titled “System Configuration Constants”The system behavior is controlled through key configuration constants:
Constant | Value | Purpose |
---|---|---|
DATA_SYNC_PERIOD | 600 seconds | Metagraph refresh interval |
ALLOCATE_CHECK_PERIOD | 180 seconds | Health check frequency |
ALLOCATE_CHECK_COUNT | 20 | Max timeout count before deallocation |
MAX_ALLOCATION_RETRY | 3 | Maximum allocation attempt retries |
VALID_VALIDATOR_HOTKEYS | Array of 19 hotkeys | Authorized validator addresses |
Sources: neurons/register_api.py:85-116
The Resource Management system integrates these components to provide robust, scalable compute resource allocation with comprehensive monitoring and state management capabilities.