Ollama + Open WebUI 本地 AI 部署完全指南 2026

Published: 2026-05-05 · Reading time: 14 min · Category: AI Infrastructure

By mid-2026, the landscape has shifted. Enterprises are pulling AI workloads back on-premise. Privacy regulations are tightening. And the models you can run on a single GPU — DeepSeek V4, Qwen 3, Llama 4, Mistral Large — are competitive with cloud APIs from 6 months ago.

This guide walks through a production-grade local AI stack that keeps your data private, your costs predictable, and your latency minimal.

1. Why Local AI in 2026

The calculus changed. Here's what's driving the shift:

💡 Bottom line: By mid-2026, local AI is no longer a compromise. For many developer workflows, it's the better choice.

2. Stack Overview

Three core components, each with a specific role:

Component Role API Compatible With
Ollama Model runner & inference engine OpenAI-compatible chat & embeddings endpoints
Open WebUI Chat interface, RAG, multi-user management Ollama API (no proxy needed)
Continue.dev VS Code / JetBrains AI assistant OpenAI-compatible (points at Ollama)

Optional but recommended: MCP servers (filesystem, GitHub, git, database) give your local AI agent tool access.

3. Docker Compose Deployment

This is the fastest path to a running stack. One command, everything wired.

docker-compose.yml

version: "3.8"
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_KEEP_ALIVE=24h
      - OLLAMA_NUM_PARALLEL=2
      - OLLAMA_MAX_LOADED_MODELS=2
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3000:8080"
    volumes:
      - open_webui_data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_SECRET_KEY=${WEBUI_SECRET_KEY:-changeme}
      - WEBUI_NAME=Local AI
      - ENABLE_SIGNUP=false
    depends_on:
      - ollama
volumes:
  ollama_data:
  open_webui_data:
⚠️ Security: Change WEBUI_SECRET_KEY to a strong random value before exposing Open WebUI to any network. Set ENABLE_SIGNUP=false after creating your admin account.

Deploy

docker compose up -d
# Check logs
docker compose logs -f
# Verify Ollama API
curl http://localhost:11434/api/tags

Open WebUI will be at http://localhost:3000. First visit creates the admin account.

4. Ollama Configuration

Pull Models

Start with the best model for your hardware:

# Run inside the container or on host with Ollama CLI
docker exec -it ollama ollama pull qwen3:72b
docker exec -it ollama ollama pull deepseek-v4:latest
docker exec -it ollama ollama pull llama4:70b
docker exec -it ollama ollama pull mistral-large:latest
# Embeddings model (for RAG in Open WebUI)
docker exec -it ollama ollama pull nomic-embed-text:v1.5

Ollama Environment Tuning

Variable Default Recommended Why
OLLAMA_KEEP_ALIVE 5m 24h Keeps model loaded in VRAM. Prevents 30s load times on each request.
OLLAMA_NUM_PARALLEL 1 2-4 Process concurrent requests. Higher values increase VRAM usage.
OLLAMA_MAX_LOADED_MODELS 1 1-2 More models = less reloading but more VRAM contention.
OLLAMA_HOST 127.0.0.1 0.0.0.0 Listen on all interfaces (needed inside Docker).

Modelfile Customizations

Create custom modelfiles to tune system prompts and parameters per model:

FROM qwen3:72b
# System prompt for coding
SYSTEM """You are an expert software engineer. Write clean, well-documented code.
Prefer practical solutions over theoretical ones.
When unsure, admit it rather than hallucinating."""
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 32768
docker exec -i ollama ollama create my-coder -f - << 'EOF'
FROM qwen3:72b
SYSTEM "You are an expert software engineer..."
PARAMETER temperature 0.3
PARAMETER num_ctx 32768
EOF

5. Open WebUI Setup

Open WebUI provides a ChatGPT-like interface backed by your local models, plus RAG for document querying.

First-Login Configuration

  1. Open http://localhost:3000 — create admin account
  2. Go to Admin Panel → Settings → Models
    • Verify Ollama connection shows your pulled models
    • Set default model to your primary (e.g., qwen3:72b)
  3. Go to Settings → Documents
    • Upload documents for RAG retrieval
    • Select nomic-embed-text:v1.5 as the embedding model
  4. Go to Settings → Interface
    • Disable signups (if not already done via env)
    • Configure user roles

Key Features Worth Enabling

6. VS Code Integration with Continue.dev

Continue.dev connects VS Code directly to your local Ollama models for inline completions, chat, and agentic coding.

Installation

# In VS Code: Extensions → search "Continue"
# Or via CLI:
code --install-extension continue.continue

Configuration (~/.continue/config.json)

{
  "models": [
    {
      "title": "Local Qwen 3",
      "provider": "ollama",
      "model": "my-coder",
      "apiBase": "http://localhost:11434",
      "contextLength": 32768,
      "completionOptions": {
        "temperature": 0.2,
        "topP": 0.9
      }
    },
    {
      "title": "DeepSeek V4",
      "provider": "ollama",
      "model": "deepseek-v4:latest",
      "apiBase": "http://localhost:11434",
      "contextLength": 65536
    }
  ],
  "tabAutocompleteModel": {
    "title": "Local Tab Complete",
    "provider": "ollama",
    "model": "qwen3:7b",
    "apiBase": "http://localhost:11434"
  },
  "embeddingsProvider": {
    "provider": "ollama",
    "model": "nomic-embed-text:v1.5",
    "apiBase": "http://localhost:11434"
  }
}
✅ Pro tip: Use a smaller model (7B-14B) for tab autocomplete — it needs to respond in under 200ms. Reserve the 70B+ model for /edit and chat commands where quality matters more than speed.

What You Get

7. MCP Server Integration

MCP (Model Context Protocol) servers give your local AI tool access — reading files, querying databases, managing git, and more. Both Continue.dev and Open WebUI support MCP.

MCP with Continue.dev

Add to your ~/.continue/config.json:

{
  "experimental": {
    "mcpServers": {
      "filesystem": {
        "command": "npx",
        "args": ["-y", "@modelcontextprotocol/server-filesystem", "/workspace"]
      },
      "github": {
        "command": "npx",
        "args": ["-y", "@modelcontextprotocol/server-github"],
        "env": {
          "GITHUB_TOKEN": "${GITHUB_TOKEN}"
        }
      },
      "git": {
        "command": "npx",
        "args": ["-y", "@modelcontextprotocol/server-git", "."]
      }
    }
  }
}

MCP with Open WebUI

Open WebUI (v0.5+) has a built-in MCP client. Go to Admin Panel → Tools → MCP Servers and add:

{
  "servers": [
    {
      "name": "filesystem",
      "transport": "stdio",
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "/data"]
    },
    {
      "name": "web-search",
      "transport": "stdio",
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-brave-search"],
      "env": {
        "BRAVE_API_KEY": "${BRAVE_API_KEY}"
      }
    }
  ]
}

8. GPU Acceleration

Without a GPU, large models will run on CPU — usable but slow (1-3 tokens/sec for 70B). Here's how to ensure GPU acceleration works.

NVIDIA GPU

# Verify NVIDIA drivers and CUDA
nvidia-smi
# Install NVIDIA container toolkit
sudo apt install nvidia-container-toolkit
sudo systemctl restart docker
# Test GPU in container
docker run --rm --gpus all nvidia/cuda:12.8-base nvidia-smi

If the Docker Compose file above has deploy.resources.reservations.devices with NVIDIA capabilities, Ollama will automatically use the GPU.

AMD ROCm

Use the ROCm build of Ollama:

image: ollama/ollama:rocm

For AMD GPUs with 16GB+ VRAM (RX 7900 XTX, MI250), ROCm support is solid for inference.

Apple Silicon (M-series)

Ollama runs natively on macOS with Metal acceleration:

# Install directly (no Docker needed on macOS)
brew install ollama
ollama serve
# Metal acceleration is automatic on M1/M2/M3/M4
# Unified memory = 64GB models on M-series Max/Ultra
ollama pull qwen3:72b
💡 Apple Silicon advantage: The unified memory architecture lets you run 70B models on 64GB M3 Ultra without quantization. No discrete GPU needed.

VRAM Requirements

Model Size Quantization VRAM Needed Hardware
7B Q4_K_M ~6 GB RTX 3060, M1, any GPU
14B Q4_K_M ~10 GB RTX 3080, M2 Pro
32B Q4_K_M ~20 GB RTX 3090, M3 Max
70B Q4_K_M ~42 GB RTX 4090 24GB (2x), M3 Ultra, A100
72B Q4_K_M ~43 GB A6000, 2x RTX 3090, M3 Ultra
120B+ Q4_K_M ~72 GB A100 80GB, 2x A6000

9. Production Hardening

Going beyond the dev setup? Here's what to add:

Authentication

Open WebUI has built-in user management with role-based access. For production:

Rate Limiting & Monitoring

# nginx rate limiting example
limit_req_zone $binary_remote_addr zone=ai_api:10m rate=30r/m;
server {
    listen 443 ssl;
    server_name ai.yourdomain.com;
    location / {
        limit_req zone=ai_api burst=10 nodelay;
        proxy_pass http://localhost:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Backup Strategy

# Backup Open WebUI data (contains user accounts, chat history, RAG docs)
docker run --rm -v open_webui_data:/data -v $(pwd):/backup \
  alpine tar czf /backup/open-webui-backup-$(date +%Y%m%d).tar.gz -C /data .
# Backup Ollama models (large — skip if you can re-pull)
# Model files are in ollama_data volume: /root/.ollama/models/

Resource Limits

services:
  ollama:
    deploy:
      resources:
        limits:
          memory: 64G
        reservations:
          memory: 32G

10. Model Selection Guide

Use Case Best Model Runner-up Min Hardware
General coding Qwen 3 72B DeepSeek V4-0514 24GB VRAM (Q4)
Tab autocomplete Qwen 3 7B / DeepSeek-Coder 7B Llama 4 8B 8GB VRAM
RAG & document QA Mistral Large 2 Qwen 3 32B 16GB VRAM
Reasoning/math DeepSeek V4-0514 Qwen 3 72B 48GB VRAM
Creative writing Llama 4 70B Mistral Large 2 48GB VRAM
Low-hardware (8GB) Qwen 3 7B Llama 4 8B 8GB VRAM
Cross-platform M-series Qwen 3 72B (Metal) DeepSeek V4 (Metal) M3 Max 48GB

11. Troubleshooting

Ollama won't use GPU

# Check if CUDA is available inside container
docker exec -it ollama nvidia-smi
# Verify container toolkit
docker info | grep -i runtime
# Force GPU device
docker compose down
docker compose up -d  # (Make sure --gpus all is in compose)

Model loads but generates garbage

Open WebUI can't connect to Ollama

# Verify from inside the webui container
docker exec -it open-webui curl http://ollama:11434/api/tags
# Check if OLLAMA_BASE_URL is set correctly in docker-compose
# Should be http://ollama:11434 (container name, not localhost)

Slow inference on Apple Silicon

Continue.dev not connecting

# Test Ollama API from host
curl http://localhost:11434/api/generate -d '{"model":"qwen3:72b","prompt":"hi"}'
# Verify apiBase in config.json matches exactly
# For Docker: apiBase is "http://host.docker.internal:11434"
# For native install: "http://localhost:11434"
Done. You now have a fully private, offline-capable AI stack running on your own hardware. Questions or issues? Check the Ollama GitHub or Open WebUI Discord for the latest community troubleshooting.

Tags: Ollama, Open WebUI, Continue.dev, Local AI, MCP, Docker, GPU, LLM