How to set up llama.cpp in router mode to utilize the AMD Radeon 780M iGPU via Vulkan within a Proxmox LXC container. It outlines the process of building the necessary components, establishing a systemd service for continuous operation, and enabling dynamic, multi-model serving for integration with OpenWebUI.
Techstack
- AMD Radeon 780M iGPU: ~25-30 TOPS with Vulkan
- Proxmox LXC: Lightweight container with GPU passthrough
- OpenWebUI: Clean web UI with automatic model discovery
Features
- Router Mode: Dynamic model loading/unloading, no manual port switching
- Service: systemd service with auto-restart & logging
Prerequisites
LXC must have GPU passthrough (/dev/dri/renderD128 accessible):
apt update
apt install -y git cmake build-essential pkg-config ccache \
libopenblas-dev libvulkan-dev libshaderc-dev glslc vulkan-tools \
mesa-vulkan-drivers libgl1-mesa-dri libssl-dev radeontopBash1. Build llama.cpp with Vulkan
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_VULKAN=ON -DGGML_BLAS=ON \
-DGGML_BLAS_VENDOR=OpenBLAS -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j"$(nproc)"Bash✅ Verify Vulkan:
vulkaninfo --summary # Must show "RADV PHOENIX" as GPU0
./build/bin/llama-cli --list-devices # Vulkan GPUs listedBash🔄 2. Update llama.cpp (Anytime)
cd ~/llama/llama.cpp
git pull origin master
rm -rf build
cmake -B build -DGGML_VULKAN=ON -DGGML_BLAS=ON \
-DGGML_BLAS_VENDOR=OpenBLAS -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j"$(nproc)"
systemctl --user restart llama-routerBashStability: git checkout vB33 (or latest stable tag) for production.
3. Directory Structure
~/llama/
├── llama.cpp/ # Source + llama-server binary
│ └── build/bin/llama-server
├── models/ # All GGUF files here
│ ├── qwen2.5-coder-7b-q4.gguf
│ ├── phi-4-mini-q4.gguf
│ └── llama3.1-8b-q4.gguf
└── llama-router.service # systemd service fileBash4. Router Mode – Test Launch
cd ~/llama/llama.cpp
./build/bin/llama-server \
--host "0.0.0.0" --port 8080 \
--models-dir ~/llama/models \
--models-max 2 \
--ctx-size 8192 \
--n-gpu-layers -1 \
--log-disableBashKey Router parameters:
--models-dir: Scans this folder for.gguffiles automatically--models-max 2: Max 2 models loaded (LRU eviction when full)- No
-mparameter: Router mode instead of single model --n-gpu-layers -1: Full GPU offload to Radeon 780M
5. Production systemd Service
Create ~/llama/llama-router.service:
[Unit]
Description=llama.cpp Router Server (AMD Vulkan)
After=network.target
[Service]
Type=simple
WorkingDirectory=/root/llama/llama.cpp
ExecStart=/root/llama/llama.cpp/build/bin/llama-server \
--host 0.0.0.0 \
--port 8080 \
--models-dir /root/llama/models \
--models-max 2 \
--ctx-size 8192 \
--n-gpu-layers -1 \
--batch-size 512 \
--log-disable
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.targetBashInstall & start:
cp ~/llama/llama-router.service /etc/systemd/system/
systemctl daemon-reload
systemctl enable --now llama-router
systemctl status llama-routerBash6. OpenWebUI Integration (Docs-Compliant)
Follows official OpenWebUI llama.cpp docs:
Admin Panel → Connections → ➕ OpenAI Compatible:
URL: http://lxc-ip:8080/v1
API Key: none (or empty)Router advantages over docs:
- OpenWebUI auto-discovers all models from
/models/ - No port switching needed between models
- Docs single-model (
-m) → Router multi-model (--models-dir)
Timeout tip (slow model loading):
export AIOHTTP_CLIENT_TIMEOUT_MODEL_LIST=30Bash7. Router Management API
# List all available models
curl http://localhost:8080/models
# Load specific model manually
curl -X POST http://localhost:8080/models/load \
-H "Content-Type: application/json" \
-d '{"model": "qwen2.5-coder-7b-q4.gguf"}'
# Check currently loaded models (OpenAI endpoint)
curl http://localhost:8080/v1/modelsBashManagement & Monitoring
# Live logs
journalctl -u llama-router -f
# Service control
systemctl {status,restart,stop} llama-router
# GPU monitoring (AMD)
radeontop
# Router health
curl -s http://localhost:8080/health | jqBashTroubleshooting
# Vulkan/GPU issues
vulkaninfo | grep RADV
./build/bin/llama-cli --list-devices
# Service failed?
journalctl -u llama-router -n50 --no-pager
# Models not visible?
ls -la ~/llama/models/*.gguf
curl http://localhost:8080/modelsBashPro Tips
- Start conservative:
--models-max 1for single-model-like behavior - Model naming: Use descriptive
.gguffilenames (shows in OpenWebUI) - Batch size:
--batch-size 512optimizes Vulkan throughput - Git stability:
git checkout vB33for production releases - LXC GPU: Verify
/dev/dri/renderD128mounted in container
Complete Update Workflow
cd ~/llama/llama.cpp && \
git pull origin master && \
rm -rf build && \
cmake -B build -DGGML_VULKAN=ON -DGGML_BLAS=ON -DCMAKE_BUILD_TYPE=Release && \
cmake --build build -j$(nproc) && \
systemctl --user restart llama-routerBash