Skip to content

Resource Management

Relevant Source Files

The Resource Management system implements the core allocation logic within the Resource Allocation API, handling resource discovery, allocation strategies, health monitoring, and state synchronization. This system operates within the RegisterAPI class and provides the intelligence for matching compute requirements with available miners, managing resource lifecycle, and maintaining distributed state.

For information about the API endpoints that expose these capabilities, see API Endpoints. For broader context about the validator system that validates miner capabilities, see Validator System.

The resource discovery system identifies and evaluates candidate miners for allocation requests through a multi-stage process involving database queries, network validation, and scoring algorithms.

flowchart TD
    DevReq["DeviceRequirement"] --> SelectCandidates["select_allocate_miners_hotkey()"]
    SelectCandidates --> CandidateList["candidates_hotkey[]"]
    CandidateList --> FilterAxons["Filter metagraph.axons"]
    FilterAxons --> AxonCandidates["axon_candidates[]"]
    
    AxonCandidates --> CheckAvailability["dendrite(Allocate(checking:True))"]
    CheckAvailability --> ValidateResponses["Validate responses"]
    ValidateResponses --> FinalCandidates["final_candidates_hotkey[]"]
    
    FinalCandidates --> ScoreSort["Score-based sorting"]
    ScoreSort --> SortedCandidates["sorted_hotkeys[]"]
    SortedCandidates --> AttemptAllocation["Attempt allocation"]

Candidate Discovery Flow

Sources: neurons/register_api.py:2741-2805

The system uses a two-phase approach for candidate selection:

  1. Database Filtering: The select_allocate_miners_hotkey() function queries the local database to find miners matching hardware requirements
  2. Network Validation: Available candidates are validated through the Bittensor network using Allocate synapse with checking=True
  3. Scoring and Prioritization: Valid candidates are sorted by their network scores to prioritize higher-performing miners
flowchart TD
    WandbQuery["get_wandb_running_miners()"] --> FilterRuns["Filter running miner runs"]
    FilterRuns --> CheckPenalized["Check penalized_hotkeys"]
    CheckPenalized --> ValidateAge["miner_is_older_than(48h)"]
    ValidateAge --> ValidatePog["miner_pog_ok(2.5h)"]
    ValidatePog --> CheckActive["Verify in metagraph.axons"]
    CheckActive --> ExtractSpecs["Extract specs from run.config"]
    ExtractSpecs --> SpecsDetails["specs_details{}"]
    SpecsDetails --> ResourceList["resource_list[]"]

WandB Resource Discovery Flow

Sources: neurons/register_api.py:1646-1702 , neurons/register_api.py:1881-1884

The system implements an alternative discovery mechanism using WandB for distributed miner information:

Validation CheckFunctionPurpose
Age Verificationminer_is_older_than()Ensures miners have been active for 48+ hours
PoG Validationminer_pog_ok()Confirms recent Proof-of-GPU completion within 2.5 hours
Penalty Checkget_penalized_hotkeys_checklist()Excludes blacklisted or penalized miners
Network Presencemetagraph.axons lookupVerifies miner is active on network

The system implements two primary allocation strategies: specification-based allocation and hotkey-specific allocation, each optimized for different use cases.

flowchart TD
    AllocateSpec["_allocate_container()"] --> BuildDevReq["Build device_requirement{}"]
    BuildDevReq --> FindCandidates["select_allocate_miners_hotkey()"]
    FindCandidates --> CheckAvailable["dendrite(checking:True)"]
    CheckAvailable --> ScoreSort["torch.ones_like(metagraph.S)"]
    ScoreSort --> TryAllocation["Loop through sorted_hotkeys"]
    TryAllocation --> SendAllocate["dendrite(Allocate(checking:False))"]
    SendAllocate --> ValidateResponse["Check response.status"]
    ValidateResponse --> Success["Return allocation info"]
    ValidateResponse --> NextCandidate["Try next candidate"]
    NextCandidate --> TryAllocation

Specification-Based Allocation Flow

Sources: neurons/register_api.py:2733-2805

The specification-based strategy processes DeviceRequirement objects containing:

  • CPU count requirements
  • GPU type and memory specifications
  • RAM and storage capacity needs
  • Timeline duration for allocation
flowchart TD
    AllocateHotkey["_allocate_container_hotkey()"] --> FindAxon["Locate axon by hotkey"]
    FindAxon --> SetDefaults["Set default device_requirement{}"]
    SetDefaults --> SetBaseImage["docker_requirement.base_image"]
    SetBaseImage --> DirectAllocate["dendrite(Allocate())"]
    DirectAllocate --> ValidateResponse["Check response.status"]
    ValidateResponse --> ReturnInfo["Return allocation + miner_version"]
    ValidateResponse --> ReturnError["Return error message"]

Hotkey-Specific Allocation Flow

Sources: neurons/register_api.py:2807-2889

This strategy targets specific miners by hotkey, bypassing the discovery phase and applying default resource requirements with configurable Docker base images.

The health monitoring system continuously tracks allocated resources and manages their lifecycle through automated checks and notifications.

flowchart TD
    CheckTask["_check_allocation()"] --> QueryDB["SELECT * FROM allocation"]
    QueryDB --> LoopAllocations["For each allocation"]
    LoopAllocations --> ValidateHotkey["Check hotkey in metagraph"]
    ValidateHotkey --> HealthCheck["dendrite_check(Allocate(checking:True))"]
    HealthCheck --> EvaluateResponse["Evaluate response"]
    
    EvaluateResponse --> Online["response.status :: False"]
    EvaluateResponse --> Offline["No response or timeout"]
    
    Online --> NotifyOnline["_notify_allocation_status(ONLINE)"]
    Online --> RemoveFromCheck["Remove from checking_allocated[]"]
    
    Offline --> AddToCheck["Add to checking_allocated[]"]
    Offline --> NotifyOffline["_notify_allocation_status(OFFLINE)"]
    Offline --> CountChecks["checking_allocated.count(hotkey)"]
    
    CountChecks --> MaxReached["count >: ALLOCATE_CHECK_COUNT"]
    MaxReached --> Deallocate["update_allocation_db(False)"]
    Deallocate --> NotifyDealloc["_notify_allocation_status(DEALLOCATION)"]

Health Monitoring Process Flow

Sources: neurons/register_api.py:3002-3101

The health monitoring system operates with the following parameters:

ParameterValuePurpose
ALLOCATE_CHECK_PERIOD180 secondsInterval between health checks
ALLOCATE_CHECK_COUNT20Maximum failed checks before deallocation
Health Check Timeout10 secondsMaximum wait time for miner response

The system tracks miner status transitions and maintains allocation state through the checking_allocated list and database updates.

Sources: neurons/register_api.py:3039-3081

The Resource Management system maintains both local and distributed state through SQLite database operations and WandB synchronization.

flowchart TD
    AllocDB["allocation table"] --> UpdateLocal["update_allocation_db()"]
    UpdateLocal --> ExtractHotkeys["Extract hotkey_list[]"]
    ExtractHotkeys --> UpdateWandB["_update_allocation_wandb()"]
    UpdateWandB --> WandbSync["wandb.update_allocated_hotkeys()"]
    
    QueryState["List operations"] --> QueryDB["SELECT hotkey, details FROM allocation"]
    QueryDB --> ParseJSON["json.loads(details)"]
    ParseJSON --> BuildAllocation["Build Allocation objects"]
    
    HealthCheck["Health monitoring"] --> StateUpdate["Allocation state changes"]
    StateUpdate --> NotifyRetry["notify_retry_table[]"]
    StateUpdate --> UpdateLocal

State Management Architecture

Sources: neurons/register_api.py:2891-2919 , neurons/register_api.py:1346-1419

The system maintains consistency across validators through WandB-based state sharing:

State ComponentStorageSynchronization Method
Active AllocationsSQLite allocation table_update_allocation_wandb()
Allocated HotkeysWandB runswandb.update_allocated_hotkeys()
Retry NotificationsIn-memory notify_retry_tablePeriodic retry processing

Sources: neurons/register_api.py:2915-2919

The notification system provides external webhook integration for allocation lifecycle events and status changes.

flowchart TD
    EventTrigger["Allocation Event"] --> EventType{"Event Type"}
    
    EventType --> Deallocation["DEALLOCATION"]
    EventType --> Online["ONLINE"] 
    EventType --> Offline["OFFLINE"]
    
    Deallocation --> BuildDeallocationMsg["Build deallocation message"]
    Online --> BuildStatusMsg["Build status change message"]
    Offline --> BuildStatusMsg
    
    BuildDeallocationMsg --> DeallocationURL["deallocation_notify_url"]
    BuildStatusMsg --> StatusURL["status_notify_url"]
    
    DeallocationURL --> SignAndSend["HMAC signature + POST"]
    StatusURL --> SignAndSend
    
    SignAndSend --> RetryLogic["Retry up to MAX_NOTIFY_RETRY"]
    RetryLogic --> Success["200/201 response"]
    RetryLogic --> AddToRetryTable["Add to notify_retry_table[]"]

Notification System Flow

Sources: neurons/register_api.py:2940-3000

The notification system operates with these key parameters:

ParameterValuePurpose
MAX_NOTIFY_RETRY3Maximum notification attempts
NOTIFY_RETRY_PERIOD15 secondsDelay between retry attempts
Webhook SignatureHMAC-SHA256Request authentication
SSL CertificatesRequiredSecure communication

Sources: neurons/register_api.py:92-93 , neurons/register_api.py:2972-2976

The Resource Management system uses several key data models and configuration parameters to define resource requirements and allocation responses.

classDiagram
    class DeviceRequirement {
        +int cpu_count
        +str gpu_type
        +int gpu_size
        +int ram
        +int hard_disk
        +int timeline
    }
    
    class Allocation {
        +str resource
        +str hotkey
        +str regkey
        +str ssh_ip
        +int ssh_port
        +str ssh_username
        +str ssh_password
        +str ssh_command
        +str ssh_key
        +str uuid_key
        +int miner_version
    }
    
    class ResourceQuery {
        +str gpu_name
        +int cpu_count_min
        +int cpu_count_max
        +float gpu_capacity_min
        +float gpu_capacity_max
        +float hard_disk_total_min
        +float hard_disk_total_max
        +float ram_total_min
        +float ram_total_max
    }
    
    class DockerRequirement {
        +str base_image
        +str ssh_key
        +str volume_path
        +str dockerfile
    }
    
    DeviceRequirement --> Allocation : "Generates"
    ResourceQuery --> Allocation : "Filters"
    DockerRequirement --> Allocation : "Configures"

Resource Management Data Models

Sources: neurons/register_api.py:147-175 , neurons/register_api.py:156-167 , neurons/register_api.py:204-213

The system behavior is controlled through key configuration constants:

ConstantValuePurpose
DATA_SYNC_PERIOD600 secondsMetagraph refresh interval
ALLOCATE_CHECK_PERIOD180 secondsHealth check frequency
ALLOCATE_CHECK_COUNT20Max timeout count before deallocation
MAX_ALLOCATION_RETRY3Maximum allocation attempt retries
VALID_VALIDATOR_HOTKEYSArray of 19 hotkeysAuthorized validator addresses

Sources: neurons/register_api.py:85-116

The Resource Management system integrates these components to provide robust, scalable compute resource allocation with comprehensive monitoring and state management capabilities.