Ollama + Open WebUI 本地 AI 部署完全指南 2026

Published: 2026-05-05 · Reading time: 14 min · Category: AI Infrastructure

By mid-2026, the landscape has shifted. Enterprises are pulling AI workloads back on-premise. Privacy regulations are tightening. And the models you can run on a single GPU — DeepSeek V4, Qwen 3, Llama 4, Mistral Large — are competitive with cloud APIs from 6 months ago.

This guide walks through a production-grade local AI stack that keeps your data private, your costs predictable, and your latency minimal.

Table of Contents

1. Why Local AI in 2026
2. Stack Overview
3. Docker Compose Deployment
4. Ollama Configuration
5. Open WebUI Setup
6. VS Code Integration with Continue.dev
7. MCP Server Integration
8. GPU Acceleration
9. Production Hardening
10. Model Selection Guide
11. Troubleshooting

1. Why Local AI in 2026

The calculus changed. Here's what's driving the shift:

Data privacy: GDPR, CCPA, and China's Personal Information Protection Law (PIPL) now impose real penalties. Sending codebase or customer data to US-based APIs carries legal risk in many jurisdictions.
Cost: Running a 70B parameter model on a single RTX 4090 costs ~$0.40/hour in electricity. Equivalent cloud API throughput at scale is 5-10x more expensive.
Latency: Local inference adds zero network hops. First token latency drops from 500-2000ms (cloud) to 30-100ms (local with GPU).
Offline: Air-gapped environments, development in transit, sites with unreliable internet — local AI works everywhere.
Model quality: Qwen 3 72B, DeepSeek V4-0514, Llama 4 70B all run on 48GB VRAM. The quality gap with GPT-5.5 is narrowing fast in code and structured tasks.

💡 Bottom line: By mid-2026, local AI is no longer a compromise. For many developer workflows, it's the better choice.

2. Stack Overview

Three core components, each with a specific role:

Component	Role	API Compatible With
Ollama	Model runner & inference engine	OpenAI-compatible chat & embeddings endpoints
Open WebUI	Chat interface, RAG, multi-user management	Ollama API (no proxy needed)
Continue.dev	VS Code / JetBrains AI assistant	OpenAI-compatible (points at Ollama)

Optional but recommended: MCP servers (filesystem, GitHub, git, database) give your local AI agent tool access.

3. Docker Compose Deployment

This is the fastest path to a running stack. One command, everything wired.

docker-compose.yml

version: "3.8"
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_KEEP_ALIVE=24h
      - OLLAMA_NUM_PARALLEL=2
      - OLLAMA_MAX_LOADED_MODELS=2
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3000:8080"
    volumes:
      - open_webui_data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_SECRET_KEY=${WEBUI_SECRET_KEY:-changeme}
      - WEBUI_NAME=Local AI
      - ENABLE_SIGNUP=false
    depends_on:
      - ollama
volumes:
  ollama_data:
  open_webui_data:

⚠️ Security: Change WEBUI_SECRET_KEY to a strong random value before exposing Open WebUI to any network. Set ENABLE_SIGNUP=false after creating your admin account.

Deploy

docker compose up -d
# Check logs
docker compose logs -f
# Verify Ollama API
curl http://localhost:11434/api/tags

Open WebUI will be at http://localhost:3000. First visit creates the admin account.

4. Ollama Configuration

Pull Models

Start with the best model for your hardware:

# Run inside the container or on host with Ollama CLI
docker exec -it ollama ollama pull qwen3:72b
docker exec -it ollama ollama pull deepseek-v4:latest
docker exec -it ollama ollama pull llama4:70b
docker exec -it ollama ollama pull mistral-large:latest
# Embeddings model (for RAG in Open WebUI)
docker exec -it ollama ollama pull nomic-embed-text:v1.5

Ollama Environment Tuning

Variable	Default	Recommended	Why
`OLLAMA_KEEP_ALIVE`	5m	24h	Keeps model loaded in VRAM. Prevents 30s load times on each request.
`OLLAMA_NUM_PARALLEL`	1	2-4	Process concurrent requests. Higher values increase VRAM usage.
`OLLAMA_MAX_LOADED_MODELS`	1	1-2	More models = less reloading but more VRAM contention.
`OLLAMA_HOST`	127.0.0.1	0.0.0.0	Listen on all interfaces (needed inside Docker).

Modelfile Customizations

Create custom modelfiles to tune system prompts and parameters per model:

FROM qwen3:72b
# System prompt for coding
SYSTEM """You are an expert software engineer. Write clean, well-documented code.
Prefer practical solutions over theoretical ones.
When unsure, admit it rather than hallucinating."""
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 32768

docker exec -i ollama ollama create my-coder -f - << 'EOF'
FROM qwen3:72b
SYSTEM "You are an expert software engineer..."
PARAMETER temperature 0.3
PARAMETER num_ctx 32768
EOF

5. Open WebUI Setup

Open WebUI provides a ChatGPT-like interface backed by your local models, plus RAG for document querying.

First-Login Configuration

Open http://localhost:3000 — create admin account
Go to Admin Panel → Settings → Models
- Verify Ollama connection shows your pulled models
- Set default model to your primary (e.g., qwen3:72b)
Go to Settings → Documents
- Upload documents for RAG retrieval
- Select nomic-embed-text:v1.5 as the embedding model
Go to Settings → Interface
- Disable signups (if not already done via env)
- Configure user roles

Key Features Worth Enabling

Web Search: Open WebUI supports web search via searxng or direct API. Enable in Admin Panel to give your local model current information.
Multi-Modal: Upload images to vision-capable models (LLaVA, Qwen-VL). Works out of the box.
Code Execution: Built-in Pyodide sandbox for running Python code in chat. Enable in Admin Panel.
Audio Input: Whisper integration for voice input. Maps to a local Whisper model or API.

6. VS Code Integration with Continue.dev

Continue.dev connects VS Code directly to your local Ollama models for inline completions, chat, and agentic coding.

Installation

# In VS Code: Extensions → search "Continue"
# Or via CLI:
code --install-extension continue.continue

Configuration (~/.continue/config.json)

{
  "models": [
    {
      "title": "Local Qwen 3",
      "provider": "ollama",
      "model": "my-coder",
      "apiBase": "http://localhost:11434",
      "contextLength": 32768,
      "completionOptions": {
        "temperature": 0.2,
        "topP": 0.9
      }
    },
    {
      "title": "DeepSeek V4",
      "provider": "ollama",
      "model": "deepseek-v4:latest",
      "apiBase": "http://localhost:11434",
      "contextLength": 65536
    }
  ],
  "tabAutocompleteModel": {
    "title": "Local Tab Complete",
    "provider": "ollama",
    "model": "qwen3:7b",
    "apiBase": "http://localhost:11434"
  },
  "embeddingsProvider": {
    "provider": "ollama",
    "model": "nomic-embed-text:v1.5",
    "apiBase": "http://localhost:11434"
  }
}

✅ Pro tip: Use a smaller model (7B-14B) for tab autocomplete — it needs to respond in under 200ms. Reserve the 70B+ model for /edit and chat commands where quality matters more than speed.

What You Get

Tab autocomplete: Code suggestions as you type (like Copilot, but local)
Chat: Full context-aware chat with code selection
Inline edit: Select code and tell the model what to change (Cmd+I)
Agent mode: The model can read files, run terminal commands, and edit multiple files
Codebase search: Semantic search across your project using local embeddings

7. MCP Server Integration

MCP (Model Context Protocol) servers give your local AI tool access — reading files, querying databases, managing git, and more. Both Continue.dev and Open WebUI support MCP.

MCP with Continue.dev

Add to your ~/.continue/config.json:

{
  "experimental": {
    "mcpServers": {
      "filesystem": {
        "command": "npx",
        "args": ["-y", "@modelcontextprotocol/server-filesystem", "/workspace"]
      },
      "github": {
        "command": "npx",
        "args": ["-y", "@modelcontextprotocol/server-github"],
        "env": {
          "GITHUB_TOKEN": "${GITHUB_TOKEN}"
        }
      },
      "git": {
        "command": "npx",
        "args": ["-y", "@modelcontextprotocol/server-git", "."]
      }
    }
  }
}

MCP with Open WebUI

Open WebUI (v0.5+) has a built-in MCP client. Go to Admin Panel → Tools → MCP Servers and add:

{
  "servers": [
    {
      "name": "filesystem",
      "transport": "stdio",
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "/data"]
    },
    {
      "name": "web-search",
      "transport": "stdio",
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-brave-search"],
      "env": {
        "BRAVE_API_KEY": "${BRAVE_API_KEY}"
      }
    }
  ]
}

8. GPU Acceleration

Without a GPU, large models will run on CPU — usable but slow (1-3 tokens/sec for 70B). Here's how to ensure GPU acceleration works.

NVIDIA GPU

# Verify NVIDIA drivers and CUDA
nvidia-smi
# Install NVIDIA container toolkit
sudo apt install nvidia-container-toolkit
sudo systemctl restart docker
# Test GPU in container
docker run --rm --gpus all nvidia/cuda:12.8-base nvidia-smi

If the Docker Compose file above has deploy.resources.reservations.devices with NVIDIA capabilities, Ollama will automatically use the GPU.

AMD ROCm

Use the ROCm build of Ollama:

image: ollama/ollama:rocm

For AMD GPUs with 16GB+ VRAM (RX 7900 XTX, MI250), ROCm support is solid for inference.

Apple Silicon (M-series)

Ollama runs natively on macOS with Metal acceleration:

# Install directly (no Docker needed on macOS)
brew install ollama
ollama serve
# Metal acceleration is automatic on M1/M2/M3/M4
# Unified memory = 64GB models on M-series Max/Ultra
ollama pull qwen3:72b

💡 Apple Silicon advantage: The unified memory architecture lets you run 70B models on 64GB M3 Ultra without quantization. No discrete GPU needed.

VRAM Requirements

Model Size	Quantization	VRAM Needed	Hardware
7B	Q4_K_M	~6 GB	RTX 3060, M1, any GPU
14B	Q4_K_M	~10 GB	RTX 3080, M2 Pro
32B	Q4_K_M	~20 GB	RTX 3090, M3 Max
70B	Q4_K_M	~42 GB	RTX 4090 24GB (2x), M3 Ultra, A100
72B	Q4_K_M	~43 GB	A6000, 2x RTX 3090, M3 Ultra
120B+	Q4_K_M	~72 GB	A100 80GB, 2x A6000

9. Production Hardening

Going beyond the dev setup? Here's what to add:

Authentication

Open WebUI has built-in user management with role-based access. For production:

Put behind a reverse proxy (nginx, Caddy, Traefik)
Add HTTPS with Let's Encrypt
Set ENABLE_SIGNUP=false after creating user accounts
Consider OAuth integration (Open WebUI supports Google, GitHub, OIDC)

Rate Limiting & Monitoring

# nginx rate limiting example
limit_req_zone $binary_remote_addr zone=ai_api:10m rate=30r/m;
server {
    listen 443 ssl;
    server_name ai.yourdomain.com;
    location / {
        limit_req zone=ai_api burst=10 nodelay;
        proxy_pass http://localhost:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Backup Strategy

# Backup Open WebUI data (contains user accounts, chat history, RAG docs)
docker run --rm -v open_webui_data:/data -v $(pwd):/backup \
  alpine tar czf /backup/open-webui-backup-$(date +%Y%m%d).tar.gz -C /data .
# Backup Ollama models (large — skip if you can re-pull)
# Model files are in ollama_data volume: /root/.ollama/models/

Resource Limits

services:
  ollama:
    deploy:
      resources:
        limits:
          memory: 64G
        reservations:
          memory: 32G

10. Model Selection Guide

Use Case	Best Model	Runner-up	Min Hardware
General coding	Qwen 3 72B	DeepSeek V4-0514	24GB VRAM (Q4)
Tab autocomplete	Qwen 3 7B / DeepSeek-Coder 7B	Llama 4 8B	8GB VRAM
RAG & document QA	Mistral Large 2	Qwen 3 32B	16GB VRAM
Reasoning/math	DeepSeek V4-0514	Qwen 3 72B	48GB VRAM
Creative writing	Llama 4 70B	Mistral Large 2	48GB VRAM
Low-hardware (8GB)	Qwen 3 7B	Llama 4 8B	8GB VRAM
Cross-platform M-series	Qwen 3 72B (Metal)	DeepSeek V4 (Metal)	M3 Max 48GB

11. Troubleshooting

Ollama won't use GPU

# Check if CUDA is available inside container
docker exec -it ollama nvidia-smi
# Verify container toolkit
docker info | grep -i runtime
# Force GPU device
docker compose down
docker compose up -d  # (Make sure --gpus all is in compose)

Model loads but generates garbage

Num_ctx too high for your VRAM → reduce to 8192 or 16384
Corrupted download → delete and re-pull: ollama rm <model> then ollama pull
Mixed precision issue → set OLLAMA_FLASH_ATTENTION=1

Open WebUI can't connect to Ollama

# Verify from inside the webui container
docker exec -it open-webui curl http://ollama:11434/api/tags
# Check if OLLAMA_BASE_URL is set correctly in docker-compose
# Should be http://ollama:11434 (container name, not localhost)

Slow inference on Apple Silicon

Make sure you're using the native macOS install, not Docker
Docker on macOS doesn't have GPU passthrough for Metal
Install with brew install ollama and run ollama serve

Continue.dev not connecting

# Test Ollama API from host
curl http://localhost:11434/api/generate -d '{"model":"qwen3:72b","prompt":"hi"}'
# Verify apiBase in config.json matches exactly
# For Docker: apiBase is "http://host.docker.internal:11434"
# For native install: "http://localhost:11434"

Done. You now have a fully private, offline-capable AI stack running on your own hardware. Questions or issues? Check the Ollama GitHub or Open WebUI Discord for the latest community troubleshooting.

Tags: Ollama, Open WebUI, Continue.dev, Local AI, MCP, Docker, GPU, LLM