llama.cpp Router Mode on AMD iGPU in Proxmox LXC

How to set up llama.cpp in router mode to utilize the AMD Radeon 780M iGPU via Vulkan within a Proxmox LXC container. It outlines the process of building the necessary components, establishing a systemd service for continuous operation, and enabling dynamic, multi-model serving for integration with OpenWebUI.

Techstack

  • AMD Radeon 780M iGPU: ~25-30 TOPS with Vulkan
  • Proxmox LXC: Lightweight container with GPU passthrough
  • OpenWebUI: Clean web UI with automatic model discovery

Features

  • Router Mode: Dynamic model loading/unloading, no manual port switching
  • Service: systemd service with auto-restart & logging

Prerequisites

LXC must have GPU passthrough (/dev/dri/renderD128 accessible):

apt update
apt install -y git cmake build-essential pkg-config ccache \
  libopenblas-dev libvulkan-dev libshaderc-dev glslc vulkan-tools \
  mesa-vulkan-drivers libgl1-mesa-dri libssl-dev radeontop
Bash

1. Build llama.cpp with Vulkan

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_VULKAN=ON -DGGML_BLAS=ON \
  -DGGML_BLAS_VENDOR=OpenBLAS -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j"$(nproc)"
Bash

✅ Verify Vulkan:

vulkaninfo --summary  # Must show "RADV PHOENIX" as GPU0
./build/bin/llama-cli --list-devices  # Vulkan GPUs listed
Bash

🔄 2. Update llama.cpp (Anytime)

cd ~/llama/llama.cpp
git pull origin master
rm -rf build
cmake -B build -DGGML_VULKAN=ON -DGGML_BLAS=ON \
  -DGGML_BLAS_VENDOR=OpenBLAS -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j"$(nproc)"
systemctl --user restart llama-router
Bash

Stability: git checkout vB33 (or latest stable tag) for production.

3. Directory Structure

~/llama/
├── llama.cpp/                    # Source + llama-server binary
   └── build/bin/llama-server
├── models/                       # All GGUF files here
   ├── qwen2.5-coder-7b-q4.gguf
   ├── phi-4-mini-q4.gguf
   └── llama3.1-8b-q4.gguf
└── llama-router.service          # systemd service file
Bash

4. Router Mode – Test Launch

cd ~/llama/llama.cpp
./build/bin/llama-server \
  --host "0.0.0.0" --port 8080 \
  --models-dir ~/llama/models \
  --models-max 2 \
  --ctx-size 8192 \
  --n-gpu-layers -1 \
  --log-disable
Bash

Key Router parameters:

  • --models-dir: Scans this folder for .gguf files automatically
  • --models-max 2: Max 2 models loaded (LRU eviction when full)
  • No -m parameter: Router mode instead of single model
  • --n-gpu-layers -1: Full GPU offload to Radeon 780M

5. Production systemd Service

Create ~/llama/llama-router.service:

[Unit]
Description=llama.cpp Router Server (AMD Vulkan)
After=network.target

[Service]
Type=simple
WorkingDirectory=/root/llama/llama.cpp
ExecStart=/root/llama/llama.cpp/build/bin/llama-server \
  --host 0.0.0.0 \
  --port 8080 \
  --models-dir /root/llama/models \
  --models-max 2 \
  --ctx-size 8192 \
  --n-gpu-layers -1 \
  --batch-size 512 \
  --log-disable
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
Bash

Install & start:

cp ~/llama/llama-router.service /etc/systemd/system/
systemctl daemon-reload
systemctl enable --now llama-router
systemctl status llama-router
Bash

6. OpenWebUI Integration (Docs-Compliant)

Follows official OpenWebUI llama.cpp docs:

Admin Panel → Connections → ➕ OpenAI Compatible:

URL: http://lxc-ip:8080/v1
API Key: none (or empty)

Router advantages over docs:

  • OpenWebUI auto-discovers all models from /models/
  • No port switching needed between models
  • Docs single-model (-m) → Router multi-model (--models-dir)

Timeout tip (slow model loading):

export AIOHTTP_CLIENT_TIMEOUT_MODEL_LIST=30
Bash

7. Router Management API

# List all available models
curl http://localhost:8080/models

# Load specific model manually
curl -X POST http://localhost:8080/models/load \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen2.5-coder-7b-q4.gguf"}'

# Check currently loaded models (OpenAI endpoint)
curl http://localhost:8080/v1/models
Bash

Management & Monitoring

# Live logs
journalctl -u llama-router -f

# Service control
systemctl {status,restart,stop} llama-router

# GPU monitoring (AMD)
radeontop

# Router health
curl -s http://localhost:8080/health | jq
Bash

Troubleshooting

# Vulkan/GPU issues
vulkaninfo | grep RADV
./build/bin/llama-cli --list-devices

# Service failed?
journalctl -u llama-router -n50 --no-pager

# Models not visible?
ls -la ~/llama/models/*.gguf
curl http://localhost:8080/models
Bash

Pro Tips

  1. Start conservative: --models-max 1 for single-model-like behavior
  2. Model naming: Use descriptive .gguf filenames (shows in OpenWebUI)
  3. Batch size: --batch-size 512 optimizes Vulkan throughput
  4. Git stability: git checkout vB33 for production releases
  5. LXC GPU: Verify /dev/dri/renderD128 mounted in container

Complete Update Workflow

cd ~/llama/llama.cpp && \
git pull origin master && \
rm -rf build && \
cmake -B build -DGGML_VULKAN=ON -DGGML_BLAS=ON -DCMAKE_BUILD_TYPE=Release && \
cmake --build build -j$(nproc) && \
systemctl --user restart llama-router
Bash

Timon
Timon
Articles: 4

Leave a Reply

Your email address will not be published. Required fields are marked *