Skip to content

Container Management

Relevant Source Files

Container Management is the core Docker container lifecycle system used by miners to provide isolated compute resources to validators and clients. This system handles the creation, configuration, monitoring, and termination of SSH-enabled containers that serve as the compute environments for resource allocation requests.

For information about how containers integrate with the broader resource allocation workflow, see Resource Allocation. For details about the communication protocols used during container provisioning, see Specs, Allocate, and Challenge Protocols.

The container management system orchestrates Docker containers through a complete lifecycle from base image preparation to final cleanup. Each container is configured with SSH access, GPU capabilities, and custom software environments based on allocation requirements.

stateDiagram-v2
    [*] --> ImageBuilding : "build_sample_container()"
    ImageBuilding --> ImageReady : "ssh-image-base created"
    ImageReady --> ContainerCreation : "run_container()"
    ContainerCreation --> ContainerRunning : "status created"
    ContainerRunning --> ContainerPaused : "pause_container()"
    ContainerPaused --> ContainerRunning : "unpause_container()"
    ContainerRunning --> ContainerStopped : "kill_container()"
    ContainerStopped --> [*] : "cleanup complete"
    ContainerRunning --> KeyExchange : "exchange_key_container()"
    KeyExchange --> ContainerRunning : "SSH keys updated"
    ContainerRunning --> ContainerRestarted : "restart_container()"
    ContainerRestarted --> ContainerRunning : "restart complete"

Sources: neurons/Miner/container.py:57-103 , neurons/Miner/container.py:384-420 , neurons/Miner/container.py:421-473

The system maintains a base container image ssh-image-base that provides the foundation for all allocated containers. This image includes SSH server configuration, Python runtime, and GPU support.

flowchart TD
    A["build_sample_container()"] --> B["Check existing images"]
    B --> C{Base image exists?}
    C -->|Yes| D["Return existing"]
    C -->|No| E["Build from pytorch/pytorch:2.7.0-cuda12.6-cudnn9-runtime"]
    E --> F["Install SSH server"]
    F --> G["Configure SSH settings"]
    G --> H["Install Python packages"]
    H --> I["Tag as ssh-image-base:latest"]
    I --> J["Base image ready"]

The base image includes:

  • PyTorch CUDA runtime environment
  • SSH server with root access enabled
  • Python 3 and pip package manager
  • Essential build tools and libraries
  • Conda environment configuration

Sources: neurons/Miner/container.py:280-368

When a resource allocation request is received, the system creates a customized container based on the base image and specific requirements.

sequenceDiagram
    participant AC as "Allocation Controller"
    participant CM as "Container Manager"
    participant DC as "Docker Client"
    participant FS as "File System"
    
    AC->>CM: "run_container(cpu_usage, ram_usage, gpu_usage, public_key, docker_requirement, testing)"
    CM->>CM: "kill_container()" 
    CM->>CM: "password_generator(10)"
    CM->>CM: "build_sample_container()"
    CM->>FS: "Create Dockerfile with custom requirements"
    CM->>DC: "images.build(path, dockerfile, tag=ssh-image)"
    CM->>DC: "containers.run(image=ssh-image, device_requests=[GPU], ports={22: ssh_port})"
    DC-->>CM: "Container created"
    CM->>CM: "rsa.encrypt_data(public_key, connection_info)"
    CM->>FS: "Write allocation_key file"
    CM-->>AC: "Return encrypted connection info"
ParameterDescriptionExample
cpu_assignmentCPU cores allocated"0-1" for 2 cores
ram_limitMemory limit"5g" for 5GB
hard_disk_capacityStorage limit"100g" for 100GB
gpu_capacityGPU allocation"all" for all GPUs
ssh_portSSH access port4444
shm_sizeShared memory size"7g" (90% of available)

Sources: neurons/Miner/container.py:105-207

The container management system implements a multi-layered security model using RSA encryption and allocation key verification.

graph TB
    subgraph "Client Side"
        CL["Client"] 
        PRV["Private Key"]
        PUB["Public Key"]
    end
    
    subgraph "Miner Container System"
        AK["allocation_key file"]
        CM["Container Manager"]
        SSH["SSH Container"]
    end
    
    subgraph "Authentication Flow"
        ENC["Encrypted Connection Info"]
        DEC["Decrypted Credentials"]
    end
    
    CL --> PUB
    PUB --> CM
    CM --> ENC
    ENC --> DEC
    PRV --> DEC
    DEC --> SSH
    
    CM --> AK
    AK --> CM

The system uses allocation keys to verify container access permissions:

  • Key Storage: Public keys are base64-encoded and stored in allocation_key file
  • Key Verification: All container operations require matching public key
  • Key Rotation: SSH keys can be updated through exchange_key_container()
FunctionPurposeKey Verification
restart_container()Restart existing containerRequired
pause_container()Pause container executionRequired
unpause_container()Resume container executionRequired
exchange_key_container()Update SSH keysRequired

Sources: neurons/Miner/container.py:370-382 , neurons/Miner/container.py:384-420 , neurons/Miner/container.py:475-521

Container management integrates with the resource allocation system through the allocate.py module, which orchestrates container lifecycle during allocation requests.

sequenceDiagram
    participant API as "RegisterAPI"
    participant ALLOC as "Allocation Controller"
    participant CONT as "Container Manager"
    participant SCHED as "Scheduler"
    
    API->>ALLOC: "register_allocation(timeline, device_requirement, public_key, docker_requirement)"
    ALLOC->>CONT: "kill_container()" 
    ALLOC->>CONT: "run_container(cpu_usage, ram_usage, hard_disk_usage, gpu_usage, public_key, docker_requirement, testing)"
    CONT-->>ALLOC: "Container info + encrypted credentials"
    ALLOC->>SCHED: "start(timeline)"
    ALLOC-->>API: "Allocation result"
    
    Note over SCHED: "Timeline expires"
    SCHED->>CONT: "kill_container(deregister=True)"

The system supports two container types:

  • Production Containers (container_name): Long-running containers for actual resource allocation
  • Test Containers (container_name_test): Short-lived containers for validation and health checks

Sources: neurons/Miner/allocate.py:29-62 , neurons/Miner/allocate.py:66-94

The system provides container status monitoring and health checking capabilities used by validators and the allocation system.

flowchart LR
    subgraph "Health Check Operations"
        A["check_container()"] --> B{Container exists?}
        B -->|Yes| C{Status == running?}
        B -->|No| D["Return False"]
        C -->|Yes| E["Return True"]
        C -->|No| D
    end
    
    subgraph "Allocation Status"
        F["check_if_allocated(public_key)"] --> G{allocation_key exists?}
        G -->|Yes| H{Key matches?}
        G -->|No| I["Return False"]
        H -->|Yes| J{Container running?}
        H -->|No| I
        J -->|Yes| K["Return True"]
        J -->|No| I
    end

The container system integrates with validator monitoring through:

  • Miner Checker: Validators use miner_checker.py to test container allocation and SSH access
  • Health Endpoints: API endpoints query container status for resource availability
  • Allocation Tracking: Container state is synchronized with allocation records

Sources: neurons/Miner/container.py:210-222 , neurons/Miner/allocate.py:106-137 , neurons/miner_checker.py:85-151

The system implements comprehensive cleanup procedures to prevent resource leaks and ensure proper container lifecycle management.

graph TD
    A["kill_container(deregister)"] --> B{deregister flag?}
    B -->|True| C["Kill production container"]
    B -->|False| D["Kill test container only"]
    C --> E["Find container_name"]
    D --> F["Find container_name_test"]
    E --> G{Container running?}
    F --> H{Container running?}
    G -->|Yes| I["exec_run('kill -15 1')"]
    G -->|No| J["remove()"]
    H -->|Yes| K["exec_run('kill -15 1')"]
    H -->|No| L["remove()"]
    I --> M["wait()"]
    K --> N["wait()"]
    M --> J
    N --> L
    J --> O["images.prune(dangling:True)"]
    L --> O

The cleanup process includes:

  • Graceful container termination using SIGTERM
  • Container removal from Docker
  • Dangling image cleanup
  • Allocation key file management

Sources: neurons/Miner/container.py:57-103