Interfaze: A New Model Architecture That Merges DNN/CNN Precision with Transformer Flexibility
TL;DR
Interfaze is a new model architecture that combines DNN/CNN task-specific encoders with omni-transformers, delivering deterministic accuracy at flash-tier pricing. It outperforms Gemini-3-Flash, Claude-Sonnet-4.6, GPT-5.4-Mini, and Grok-4.3 across 9 head-to-head benchmarks covering OCR, vision, speech-to-text, and structured output. Priced at $1.50/M input and $3.50/M output tokens, it's positioned as a drop-in replacement for general-purpose flash models on deterministic tasks.
The Problem: Wrong Models for Wrong Tasks
For years, developers have been reaching for general-purpose LLMs (transformers) to solve deterministic tasks like OCR, document extraction, and GUI detection. The result is predictable: transformer models hallucinate on precise extraction, miss bounding boxes, add cost, and introduce latency — all because they're designed for creative generation, not deterministic accuracy.
On the other hand, specialized DNN/CNN models (think CRNN-CTC, ResNet) are up to 100x more accurate at their specific task and produce useful metadata like bounding boxes and confidence scores. But they're rigid: only as good as their training data, expensive to retrain for new tasks, and can't handle the nuance and decision-making transformers excel at.
Interfaze's insight is simple: why not both?
Architecture: How Interfaze Works
Instead of forcing a pure transformer to do OCR or forcing a CNN to do reasoning, Interfaze builds a hybrid pipeline that runs task-specific DNN/CNN encoders alongside an omni-transformer backbone in a shared vector space:
- Input Modalities: Text, Images, Audio, Files (1M token context window, 32K max output)
- Task-Specific Encoders: Dedicated DNN/CNN layers for OCR text-line detection, object bounding boxes, GUI element recognition, and audio feature extraction
- Transformer Backbone: Omni-transformer for reasoning, translation, decision-making, and structured output
- Shared Latent Space: Encoder outputs and transformer representations share the same vector space — enabling the model to combine "what it sees" with "what it knows"
- Optional Reasoning: Reasoning mode available (disabled by default for speed)
The result is a model that can both extract a date-of-birth from a passport with pixel-perfect bounding boxes and calculate the person's age from the extracted data — all in one pass.
Benchmark Results: Head-to-Head Against Flash Models
Interfaze is benchmarked against models in a similar pricing tier: Gemini-3-Flash, Claude-Sonnet-4.6, GPT-5.4-Mini, and Grok-4.3. All results are Interfaze-led across the board:
| Benchmark | Interfaze | Gemini-3-Flash | Claude-Sonnet-4.6 | GPT-5.4-Mini | Grok-4.3 |
|---|---|---|---|---|---|
| OCRBench V2 | 70.7% | 55.8% | 54.7% | 52.7% | 54.7% |
| olmOCR | 85.7% | 75.3% | 73.9% | 80.1% | 81.9% |
| RefCOCO (Object Detection) | 82.1% | 75.2% | 75.5% | 67.0% | 25.0% |
| VoxPopuli (WER) ↓ | 2.4% | 4.0% | — | — | — |
| Spider 2.0-Lite | 52.9% | 45.2% | 49.6% | 26.7% | 45.9% |
| GPQA Diamond | 89.9% | 88.5% | 89.9% | 82.8% | 73.6% |
| MMMLU | 90.9% | 88.7% | 84.9% | 75.3% | 89.7% |
| MMMU-Pro | 71.1% | 67.6% | 46.3% | 40.4% | 68.7% |
| SOB Value Acc | 79.5% | 77.3% | 77.9% | 75.1% | 78.4% |
↓ = lower is better (Word Error Rate). "—" = model has no native audio input.
The margin is particularly striking on OCRBench V2 (+14.9% over the next best), olmOCR (+3.8%), RefCOCO (+6.9%), and VoxPopuli (nearly half the error rate of Gemini-3-Flash). On general reasoning benchmarks like GPQA Diamond and MMMLU, Interfaze matches or slightly exceeds the best flash models — suggesting the hybrid architecture doesn't compromise general intelligence.
Structured Output: The SOB Benchmark
One of Interfaze's key innovations is the Structured Output Benchmark (SOB) they released alongside the model. Most LLMs follow JSON schemas well but fill them with hallucinated values. SOB tests: given the correct answer in context, can the model output it accurately?
Interfaze leads SOB at 79.5% Value Accuracy across text, image, and audio modalities — 1.6% ahead of Grok-4.3 and 2.2% ahead of Claude-Sonnet-4.6. For developers building deterministic pipelines (document processing, form filling, data extraction), this matters more than raw MMLU scores.
Speech-to-Text: Competitive with Dedicated ASR
On VoxPopuli-Cleaned-AA, Interfaze achieves a 2.4% Word Error Rate — second only to dedicated ASR providers and significantly better than Gemini-3-Flash (4.0%). More importantly, its inference speed is 209 seconds of audio per second of compute:
- ~1.5× faster than Deepgram Nova-3
- ~8× faster than Scribe v2
- ~11× faster than Gemini-3-Flash
Pricing and API Compatibility
Interfaze is priced at the flash tier:
- Input: $1.50 per million tokens
- Output: $3.50 per million tokens
It speaks the Chat Completions API standard — compatible with OpenAI SDK, Vercel AI SDK, LangChain, and any tool that speaks OpenAI-compatible APIs:
import OpenAI from "openai";
const interfaze = new OpenAI({
baseURL: "https://api.interfaze.ai/v1",
apiKey: "<your-api-key>",
});
// Use it like any OpenAI model
const response = await interfaze.chat.completions.create({
model: "interfaze-v1",
messages: [
{ role: "user", content: [
{ type: "text", text: "Extract all text from this document with bounding boxes" },
{ type: "image_url", image_url: { url: "https://..." } }
]}
]
});
Use Cases
- OCR at scale: Complex PDFs, multi-column layouts, handwritten forms — with bounding box metadata
- Document processing pipelines: Extract + validate + transform in one pass
- GUI/UI automation: Detect buttons, inputs, and UI elements with pixel coordinates
- Speech-to-text: High-accuracy transcription with speaker diarization
- Structured data extraction: Forms, invoices, receipts — with reliable JSON output
- Translation: Multilingual with object-aware context (knows what the text refers to)
The Bigger Picture: A Third Category of AI Model
Interfaze represents an emerging third category of AI model — between general-purpose frontier models (Claude Opus, GPT-5.5) and specialized single-task models (Mistral OCR, Whisper). Its hybrid architecture opens up a design space where:
- Deterministic tasks get DNN-level precision
- Flexible reasoning gets transformer-level understanding
- Cost stays at flash-tier pricing
This isn't an LLM replacement — it's an answer to the question: "Why am I paying a creative writer to do data entry?"
For developers building document-heavy, automation-heavy, or extraction-heavy pipelines, Interfaze is worth a serious look. The API compatibility means zero migration pain — just swap the endpoint and base URL.
Links: Official Announcement • HN Discussion • Full Leaderboard • Documentation