Antirez DS4 (DwarfStar 4): Redis Creator's Standalone Inference Engine for DeepSeek V4 Flash

What Is DS4 (DwarfStar 4)?

DS4 (DwarfStar 4) is a self-contained native inference engine built by Salvatore Sanfilippo (antirez) — the creator of Redis — specifically for DeepSeek V4 Flash. Unlike general-purpose runners like Ollama or llama.cpp wrappers, DS4 is intentionally narrow: it does one thing and does it well.

The engine handles everything needed for production-quality local inference: model loading, prompt rendering, tool calling, KV state handling (both RAM and on-disk), and a full server API — all ready to work with coding agents or the provided CLI interface.

What makes this project particularly exciting is that antirez himself said this is the first time he uses a local model for serious work instead of Claude or GPT. Coming from someone who has been deep in the AI space for years, that's a significant endorsement of DeepSeek V4 Flash's quality at the local inference level.

The project immediately hit the Hacker News front page, racking up 79+ points in under an hour and over 24k views — a clear signal of the community's hunger for capable local AI inference.

Why DeepSeek V4 Flash Deserves a Dedicated Engine

Most inference engines try to run every model under the sun. DS4 takes the opposite approach. Here's why antirez believes DeepSeek V4 Flash is special enough to warrant its own standalone engine:

Speed: DeepSeek V4 Flash is faster than dense models of similar quality because it activates fewer parameters per token (MoE architecture).
Efficient Thinking: In thinking mode, the model produces a thinking section that is much shorter than other models — often 1/5 the length — and crucially, the thinking length scales with problem complexity. This makes it usable with thinking enabled when other models are practically unusable.
1M Token Context: The model features a native 1 million token context window.
Frontier-Level Knowledge: At 284B parameters (sparsely activated), it knows significantly more than smaller dense models — especially on niche topics like Italian culture or politics.
Exceptional KV Compression: The KV cache is incredibly compressed, enabling long-context inference even on local machines and allowing persistent on-disk KV caching.
2-Bit Viability: When quantized asymmetrically (2-bit routed MoE, full-precision shared components), the model performs remarkably well — good enough for reliable tool calling under coding agents.

Supported Backends: Metal, CUDA, ROCm

DS4 supports three accelerator backends, reflecting a deliberate focus on the hardware that can actually run a 284B parameter model:

Apple Metal (Primary Target)

Metal is the primary development backend. The engine targets MacBooks with 96GB+ of RAM and has been extensively tested on:

MacBook Pro M3 Max (128GB) — Q2 imatrix runs at ~27 tokens/s generation
Mac Studio M3 Ultra (512GB) — Q2 at ~37 tokens/s, Q4 at ~35 tokens/s
MacBook Pro M4 Max — community tested with positive results

Build with: make

NVIDIA CUDA

CUDA support is optimized for the DGX Spark (GB10) — NVIDIA's compact desktop AI supercomputer. It also works on other local CUDA GPUs:

DGX Spark GB10 (128GB unified) — ~14 tokens/s generation for Q2 at 7k context
General NVIDIA GPUs with sufficient VRAM

Build with: make cuda-spark (DGX Spark) or make cuda-generic (other GPUs)

AMD ROCm (Community Branch)

ROCm support lives in a separate branch. Since antirez doesn't have direct AMD hardware access, the community maintains and rebases this branch as needed.

This is a pragmatic approach that keeps the main branch clean while still supporting the AMD ecosystem.

Key Features and Architecture

1. Asymmetric 2/8 Bit Quantization

DS4 uses a novel asymmetric quantization strategy that's far from the typical "2-bit is a joke" perception:

Routed MoE experts are quantized heavily (up/gate at IQ2_XXS, down at Q2_K) — the majority of model parameters
Shared experts, projections, and routing are left at full precision (8-bit or higher)
This selective approach preserves model quality where it matters most while dramatically reducing memory footprint
Imatrix-tuned variants are preferred for better quality

2. KV State as a First-Class Disk Citizen

antirez makes a compelling architectural bet: compressed KV caches (like DeepSeek V4's) plus fast SSD storage should change our assumption that KV cache lives in RAM. DS4 treats KV state as primarily disk-resident, enabling:

Very long context inference on machines with limited RAM
Persistent KV cache across sessions
Efficient memory management for 1M token context windows

3. Tool Calling Ready

DS4 ships with built-in tool calling support — not as an afterthought, but as a core feature. The GGUF files distributed by antirez are verified to work reliably with tool calling agents. This is critical for practical use cases like coding agents and automated workflows.

4. Server API + CLI

The engine provides two interfaces:

./ds4 — CLI interface for interactive use, testing, and benchmarking
./ds4-server — HTTP API server for integration with coding agents and external tools

Both accept the same model path via the -m flag.

5. GGUF and IMatrix Tooling

DS4 includes a full suite of offline tools for model builders:

GGUF generation and quantization
IMatrix (importance matrix) collection and application
Quality testing against official DeepSeek V4 Flash logits
Speed benchmarking with CSV output and graph generation

Performance Benchmarks

Here are the real-world performance numbers from antirez's testing (Metal backend, --ctx 32768, --nothink, greedy decoding, -n 256):

Machine	Quant	Prompt	Prefill	Generation
M3 Max 128GB	Q2	Short	58.52 t/s	26.68 t/s
M3 Max 128GB	Q2	11,709 tokens	250.11 t/s	21.47 t/s
M3 Ultra 512GB	Q2	Short	84.43 t/s	36.86 t/s
M3 Ultra 512GB	Q2	11,709 tokens	468.03 t/s	27.39 t/s
M3 Ultra 512GB	Q4	Short	78.95 t/s	35.50 t/s
M3 Ultra 512GB	Q4	12,018 tokens	448.82 t/s	26.62 t/s
DGX Spark 128GB	Q2	7,047 tokens	343.81 t/s	13.75 t/s

Key takeaway: Q2 imatrix gives you 20-37 tokens/s on Apple Silicon — perfectly usable for interactive work, coding agents, and chat. Q4 requires 256GB+ machines but delivers comparable speed with higher quality.

How to Get Started with DS4

Step 1: Download the Model

The project ships with a convenient download script that fetches pre-quantized GGUF files from Hugging Face:

# For 96-128GB machines (recommended):
./download_model.sh q2-imatrix

# For 256GB+ machines:
./download_model.sh q4-imatrix

# Legacy variants (if you need non-imatrix):
./download_model.sh q2   # 96-128GB
./download_model.sh q4   # 256GB+

The script stores files under ./gguf/, resumes partial downloads with curl -C -, and creates a symlink at ./ds4flash.gguf pointing to the selected model.

Step 2: Build the Engine

# macOS with Metal
make

# Linux with CUDA (DGX Spark)
make cuda-spark

# Linux with CUDA (other GPUs)
make cuda-generic

# CPU-only diagnostics
make cpu

Step 3: Run Inference

# CLI interactive mode
./ds4

# Server mode (HTTP API)
./ds4-server

# Custom model path
./ds4 -m ./gguf/my-model.gguf

# Use with MTP speculative decoding (experimental)
./ds4 --mtp

Pass --help to either binary for the full flag list.

Hardware Requirements

DS4 is designed for high-end personal machines. Here's what you need:

Minimum (Q2 imatrix): 96GB RAM (many users report it works at 250k context)
Recommended (Q2 imatrix): 128GB RAM for comfortable context windows
Q4 operation: 256GB+ RAM
Ultimate: Mac Studio M3 Ultra 512GB or DGX Spark GB10

Note: CPU-only builds are available for diagnostics and tokenizer testing, but macOS has a kernel bug in virtual memory that crashes the CPU path. Use Metal or CUDA for actual inference.

MTP Speculative Decoding (Experimental)

DS4 includes experimental Multi-Token Prediction (MTP) support for speculative decoding. An optional GGUF file can be downloaded via ./download_model.sh mtp. Currently, the feature is correctness-gated and provides at most a slight speedup — not yet a meaningful generation-speed win. It works with all quantization variants.

Quality and Validation Philosophy

One of DS4's most distinctive features is its commitment to official-vector validation. Every GGUF file distributed for this project is tested against logits obtained from the official DeepSeek V4 Flash implementation. The test suite includes:

Continuation vectors at different context sizes
Long-context tests (thousands of tokens)
Regression checks against known outputs
Agent integration tests for tool calling

This is a significant departure from the typical local inference project, where "it runs" is often the only quality bar. antirez's goal is to make one model feel finished end-to-end, not just runnable.

Relationship with llama.cpp and GGML

DS4 is self-contained — ds4.c does not link against GGML. However, it acknowledges deep debt to the llama.cpp project and Georgi Gerganov's work. The project adapts GGUF quant layouts, CPU quant/dot logic, and certain kernels under the MIT license. It's a derivative in the best sense: standing on the shoulders of a foundational project while building something deliberately specialized.

What This Means for Local AI

DS4 represents a significant philosophy shift in local AI inference. Instead of trying to run everything, antirez argues for deep optimization of one important model. The same approach could apply to future frontier models — the engine may change targets, but the constraint remains: credible local inference on high-end personal machines.

The fact that the creator of Redis — someone with decades of systems programming expertise — is choosing a local model over Claude and GPT for serious work is a powerful signal. If DeepSeek V4 Flash on a MacBook with DS4 can replace cloud API calls for a systems programmer of antirez's caliber, the local AI landscape just got a lot more interesting.

Follow the project on GitHub: github.com/antirez/ds4

This article was written on May 15, 2026. DS4 is alpha-quality software — expect rapid changes, improvements, and bug fixes. Always check the latest README and open issues before deploying.