Troubleshooting
This document describes diagnosis and resolution for common issues.
When doctor, toolkit verify, runtime prepare, or up fails, miner-cli prints a Next steps block. Treat that as the first remediation path.
Host GPU Issues
nvidia-smi: not found
Meaning: The host NVIDIA driver is not installed, or nvidia-smi is not on PATH.
Resolution:
- Fix the host driver first
- Verify:
nvidia-smi
uv run miner-cli toolkit verify
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver
Meaning: The driver package may exist, but the kernel module or driver state is broken.
Resolution: Repair the driver or module state, confirm nvidia-smi works, then retry.
gpu inventory: no GPUs detected
Meaning: The driver is running, but no GPU is visible to the host.
Resolution: Check PCI visibility, VM passthrough, cloud GPU attachment, or container host configuration.
Docker GPU Runtime Issues
docker nvidia runtime: not configured
Meaning: Docker is installed, but GPU runtime wiring is incomplete.
uv run miner-cli toolkit install
uv run miner-cli toolkit verify --smoke-test
GPU container smoke test fails
Common errors include driver version is insufficient or CUDA version errors.
Resolution: Upgrade the host driver, pin an older runtime image, or repair Docker GPU runtime wiring.
Runtime Issues
Image Pull Failed
Meaning: The image tag may not exist, registry access may be broken, or authentication may be missing.
uv run miner-cli runtime prepare --engine vllm -f qwen72b.yaml
Verify the configured image: and registry credentials.
Engine Container Smoke Test Fails
Meaning: The image can be pulled, but the engine container cannot start correctly with GPU access.
uv run miner-cli runtime prepare --engine vllm -f qwen72b.yaml --smoke-test
Check CUDA/driver compatibility, entrypoint behavior, and model access permissions.
Container Startup Failed
Meaning: Compose created the deployment, but the workload container failed to start.
uv run miner-cli logs qwen72b -f
uv run miner-cli runtime prepare --engine vllm -f qwen72b.yaml --smoke-test
Readiness Timeout
Meaning: The container is running, but /v1/models did not become healthy within the timeout.
Resolution: Check logs for model download progress, GPU memory, model path errors, or authentication failures.
Agent Issues
/readyz returns 503
Check the response body:
registered=false: Registration has not succeededverified=false: The control plane has not verified this nodelast_erroris set: Inspect the latest registration or heartbeat failure
curl http://127.0.0.1:8080/v1/miner/status
curl http://127.0.0.1:8080/v1/miner/identity
Identity changes after restart
Meaning: ${MINER_HOME}/config.json was not persisted.
Resolution: Mount a stable host directory into the agent container:
volumes:
- /data/minerhome:/root/.miner
environment:
MINER_HOME: /root/.miner
Keep the host path on the left side outside /root; the container path can remain /root/.miner.
YAML changes do not update miner identity or registration details
Meaning: Editing YAML updates rendered Compose configuration, but it does not replace the persisted ${MINER_HOME}/config.json created on first startup.
Resolution: Verify the active YAML and rendered Compose file, then check agent status and identity:
curl http://127.0.0.1:8080/v1/miner/status
curl http://127.0.0.1:8080/v1/miner/identity
Only replace ${MINER_HOME}/config.json when you intentionally want a new miner identity. Back up the directory first.
Runtime probe errors
Verify that MINER_VLLM_BASE_URL points to the runtime service inside the Compose network. Default when deployed through miner-cli:
http://<deployment-name>:<port>
If DCGM is enabled, verify that MINER_DCGM_METRICS_URL points to:
http://dcgm-exporter:9400/metrics