Ollama + Open WebUI 本地 AI 部署完全指南 2026
Published: 2026-05-05 · Reading time: 14 min · Category: AI Infrastructure
By mid-2026, the landscape has shifted. Enterprises are pulling AI workloads back on-premise. Privacy regulations are tightening. And the models you can run on a single GPU — DeepSeek V4, Qwen 3, Llama 4, Mistral Large — are competitive with cloud APIs from 6 months ago.
This guide walks through a production-grade local AI stack that keeps your data private, your costs predictable, and your latency minimal.
1. Why Local AI in 2026
The calculus changed. Here's what's driving the shift:
- Data privacy: GDPR, CCPA, and China's Personal Information Protection Law (PIPL) now impose real penalties. Sending codebase or customer data to US-based APIs carries legal risk in many jurisdictions.
- Cost: Running a 70B parameter model on a single RTX 4090 costs ~$0.40/hour in electricity. Equivalent cloud API throughput at scale is 5-10x more expensive.
- Latency: Local inference adds zero network hops. First token latency drops from 500-2000ms (cloud) to 30-100ms (local with GPU).
- Offline: Air-gapped environments, development in transit, sites with unreliable internet — local AI works everywhere.
- Model quality: Qwen 3 72B, DeepSeek V4-0514, Llama 4 70B all run on 48GB VRAM. The quality gap with GPT-5.5 is narrowing fast in code and structured tasks.
2. Stack Overview
Three core components, each with a specific role:
| Component | Role | API Compatible With |
|---|---|---|
| Ollama | Model runner & inference engine | OpenAI-compatible chat & embeddings endpoints |
| Open WebUI | Chat interface, RAG, multi-user management | Ollama API (no proxy needed) |
| Continue.dev | VS Code / JetBrains AI assistant | OpenAI-compatible (points at Ollama) |
Optional but recommended: MCP servers (filesystem, GitHub, git, database) give your local AI agent tool access.
3. Docker Compose Deployment
This is the fastest path to a running stack. One command, everything wired.
docker-compose.yml
version: "3.8"
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_KEEP_ALIVE=24h
- OLLAMA_NUM_PARALLEL=2
- OLLAMA_MAX_LOADED_MODELS=2
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
ports:
- "3000:8080"
volumes:
- open_webui_data:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- WEBUI_SECRET_KEY=${WEBUI_SECRET_KEY:-changeme}
- WEBUI_NAME=Local AI
- ENABLE_SIGNUP=false
depends_on:
- ollama
volumes:
ollama_data:
open_webui_data:
WEBUI_SECRET_KEY to a strong random value before exposing Open WebUI to any network. Set ENABLE_SIGNUP=false after creating your admin account.
Deploy
docker compose up -d
# Check logs
docker compose logs -f
# Verify Ollama API
curl http://localhost:11434/api/tags
Open WebUI will be at http://localhost:3000. First visit creates the admin account.
4. Ollama Configuration
Pull Models
Start with the best model for your hardware:
# Run inside the container or on host with Ollama CLI
docker exec -it ollama ollama pull qwen3:72b
docker exec -it ollama ollama pull deepseek-v4:latest
docker exec -it ollama ollama pull llama4:70b
docker exec -it ollama ollama pull mistral-large:latest
# Embeddings model (for RAG in Open WebUI)
docker exec -it ollama ollama pull nomic-embed-text:v1.5
Ollama Environment Tuning
| Variable | Default | Recommended | Why |
|---|---|---|---|
OLLAMA_KEEP_ALIVE |
5m | 24h | Keeps model loaded in VRAM. Prevents 30s load times on each request. |
OLLAMA_NUM_PARALLEL |
1 | 2-4 | Process concurrent requests. Higher values increase VRAM usage. |
OLLAMA_MAX_LOADED_MODELS |
1 | 1-2 | More models = less reloading but more VRAM contention. |
OLLAMA_HOST |
127.0.0.1 | 0.0.0.0 | Listen on all interfaces (needed inside Docker). |
Modelfile Customizations
Create custom modelfiles to tune system prompts and parameters per model:
FROM qwen3:72b
# System prompt for coding
SYSTEM """You are an expert software engineer. Write clean, well-documented code.
Prefer practical solutions over theoretical ones.
When unsure, admit it rather than hallucinating."""
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 32768
docker exec -i ollama ollama create my-coder -f - << 'EOF'
FROM qwen3:72b
SYSTEM "You are an expert software engineer..."
PARAMETER temperature 0.3
PARAMETER num_ctx 32768
EOF
5. Open WebUI Setup
Open WebUI provides a ChatGPT-like interface backed by your local models, plus RAG for document querying.
First-Login Configuration
- Open
http://localhost:3000— create admin account - Go to Admin Panel → Settings → Models
- Verify Ollama connection shows your pulled models
- Set default model to your primary (e.g.,
qwen3:72b)
- Go to Settings → Documents
- Upload documents for RAG retrieval
- Select
nomic-embed-text:v1.5as the embedding model
- Go to Settings → Interface
- Disable signups (if not already done via env)
- Configure user roles
Key Features Worth Enabling
- Web Search: Open WebUI supports web search via
searxngor direct API. Enable in Admin Panel to give your local model current information. - Multi-Modal: Upload images to vision-capable models (LLaVA, Qwen-VL). Works out of the box.
- Code Execution: Built-in Pyodide sandbox for running Python code in chat. Enable in Admin Panel.
- Audio Input: Whisper integration for voice input. Maps to a local Whisper model or API.
6. VS Code Integration with Continue.dev
Continue.dev connects VS Code directly to your local Ollama models for inline completions, chat, and agentic coding.
Installation
# In VS Code: Extensions → search "Continue"
# Or via CLI:
code --install-extension continue.continue
Configuration (~/.continue/config.json)
{
"models": [
{
"title": "Local Qwen 3",
"provider": "ollama",
"model": "my-coder",
"apiBase": "http://localhost:11434",
"contextLength": 32768,
"completionOptions": {
"temperature": 0.2,
"topP": 0.9
}
},
{
"title": "DeepSeek V4",
"provider": "ollama",
"model": "deepseek-v4:latest",
"apiBase": "http://localhost:11434",
"contextLength": 65536
}
],
"tabAutocompleteModel": {
"title": "Local Tab Complete",
"provider": "ollama",
"model": "qwen3:7b",
"apiBase": "http://localhost:11434"
},
"embeddingsProvider": {
"provider": "ollama",
"model": "nomic-embed-text:v1.5",
"apiBase": "http://localhost:11434"
}
}
/edit and chat commands where quality matters more than speed.
What You Get
- Tab autocomplete: Code suggestions as you type (like Copilot, but local)
- Chat: Full context-aware chat with code selection
- Inline edit: Select code and tell the model what to change (
Cmd+I) - Agent mode: The model can read files, run terminal commands, and edit multiple files
- Codebase search: Semantic search across your project using local embeddings
7. MCP Server Integration
MCP (Model Context Protocol) servers give your local AI tool access — reading files, querying databases, managing git, and more. Both Continue.dev and Open WebUI support MCP.
MCP with Continue.dev
Add to your ~/.continue/config.json:
{
"experimental": {
"mcpServers": {
"filesystem": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "/workspace"]
},
"github": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-github"],
"env": {
"GITHUB_TOKEN": "${GITHUB_TOKEN}"
}
},
"git": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-git", "."]
}
}
}
}
MCP with Open WebUI
Open WebUI (v0.5+) has a built-in MCP client. Go to Admin Panel → Tools → MCP Servers and add:
{
"servers": [
{
"name": "filesystem",
"transport": "stdio",
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "/data"]
},
{
"name": "web-search",
"transport": "stdio",
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-brave-search"],
"env": {
"BRAVE_API_KEY": "${BRAVE_API_KEY}"
}
}
]
}
8. GPU Acceleration
Without a GPU, large models will run on CPU — usable but slow (1-3 tokens/sec for 70B). Here's how to ensure GPU acceleration works.
NVIDIA GPU
# Verify NVIDIA drivers and CUDA
nvidia-smi
# Install NVIDIA container toolkit
sudo apt install nvidia-container-toolkit
sudo systemctl restart docker
# Test GPU in container
docker run --rm --gpus all nvidia/cuda:12.8-base nvidia-smi
If the Docker Compose file above has deploy.resources.reservations.devices with NVIDIA capabilities, Ollama will automatically use the GPU.
AMD ROCm
Use the ROCm build of Ollama:
image: ollama/ollama:rocm
For AMD GPUs with 16GB+ VRAM (RX 7900 XTX, MI250), ROCm support is solid for inference.
Apple Silicon (M-series)
Ollama runs natively on macOS with Metal acceleration:
# Install directly (no Docker needed on macOS)
brew install ollama
ollama serve
# Metal acceleration is automatic on M1/M2/M3/M4
# Unified memory = 64GB models on M-series Max/Ultra
ollama pull qwen3:72b
VRAM Requirements
| Model Size | Quantization | VRAM Needed | Hardware |
|---|---|---|---|
| 7B | Q4_K_M | ~6 GB | RTX 3060, M1, any GPU |
| 14B | Q4_K_M | ~10 GB | RTX 3080, M2 Pro |
| 32B | Q4_K_M | ~20 GB | RTX 3090, M3 Max |
| 70B | Q4_K_M | ~42 GB | RTX 4090 24GB (2x), M3 Ultra, A100 |
| 72B | Q4_K_M | ~43 GB | A6000, 2x RTX 3090, M3 Ultra |
| 120B+ | Q4_K_M | ~72 GB | A100 80GB, 2x A6000 |
9. Production Hardening
Going beyond the dev setup? Here's what to add:
Authentication
Open WebUI has built-in user management with role-based access. For production:
- Put behind a reverse proxy (nginx, Caddy, Traefik)
- Add HTTPS with Let's Encrypt
- Set
ENABLE_SIGNUP=falseafter creating user accounts - Consider OAuth integration (Open WebUI supports Google, GitHub, OIDC)
Rate Limiting & Monitoring
# nginx rate limiting example
limit_req_zone $binary_remote_addr zone=ai_api:10m rate=30r/m;
server {
listen 443 ssl;
server_name ai.yourdomain.com;
location / {
limit_req zone=ai_api burst=10 nodelay;
proxy_pass http://localhost:3000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
Backup Strategy
# Backup Open WebUI data (contains user accounts, chat history, RAG docs)
docker run --rm -v open_webui_data:/data -v $(pwd):/backup \
alpine tar czf /backup/open-webui-backup-$(date +%Y%m%d).tar.gz -C /data .
# Backup Ollama models (large — skip if you can re-pull)
# Model files are in ollama_data volume: /root/.ollama/models/
Resource Limits
services:
ollama:
deploy:
resources:
limits:
memory: 64G
reservations:
memory: 32G
10. Model Selection Guide
| Use Case | Best Model | Runner-up | Min Hardware |
|---|---|---|---|
| General coding | Qwen 3 72B | DeepSeek V4-0514 | 24GB VRAM (Q4) |
| Tab autocomplete | Qwen 3 7B / DeepSeek-Coder 7B | Llama 4 8B | 8GB VRAM |
| RAG & document QA | Mistral Large 2 | Qwen 3 32B | 16GB VRAM |
| Reasoning/math | DeepSeek V4-0514 | Qwen 3 72B | 48GB VRAM |
| Creative writing | Llama 4 70B | Mistral Large 2 | 48GB VRAM |
| Low-hardware (8GB) | Qwen 3 7B | Llama 4 8B | 8GB VRAM |
| Cross-platform M-series | Qwen 3 72B (Metal) | DeepSeek V4 (Metal) | M3 Max 48GB |
11. Troubleshooting
Ollama won't use GPU
# Check if CUDA is available inside container
docker exec -it ollama nvidia-smi
# Verify container toolkit
docker info | grep -i runtime
# Force GPU device
docker compose down
docker compose up -d # (Make sure --gpus all is in compose)
Model loads but generates garbage
- Num_ctx too high for your VRAM → reduce to 8192 or 16384
- Corrupted download → delete and re-pull:
ollama rm <model>thenollama pull - Mixed precision issue → set
OLLAMA_FLASH_ATTENTION=1
Open WebUI can't connect to Ollama
# Verify from inside the webui container
docker exec -it open-webui curl http://ollama:11434/api/tags
# Check if OLLAMA_BASE_URL is set correctly in docker-compose
# Should be http://ollama:11434 (container name, not localhost)
Slow inference on Apple Silicon
- Make sure you're using the native macOS install, not Docker
- Docker on macOS doesn't have GPU passthrough for Metal
- Install with
brew install ollamaand runollama serve
Continue.dev not connecting
# Test Ollama API from host
curl http://localhost:11434/api/generate -d '{"model":"qwen3:72b","prompt":"hi"}'
# Verify apiBase in config.json matches exactly
# For Docker: apiBase is "http://host.docker.internal:11434"
# For native install: "http://localhost:11434"
Tags: Ollama, Open WebUI, Continue.dev, Local AI, MCP, Docker, GPU, LLM