Run AI locally.
No compromises.
The most complete local AI engine for Mac — 224× faster than LM Studio at 100K context. The only MLX engine where VL models work with the full 5-layer caching stack. Speculative decoding, Mamba/SSM support, 50+ auto-detected architectures, 14 tool parsers, 4 reasoning parsers, 20+ agentic tools, voice, vision. Nothing else comes close.
The most complete MLX engine.
Vision. Mamba. Five caching layers.
Nothing else
comes close.
The only MLX inference engine where vision-language models work with the full caching stack — prefix cache + paged KV cache + KV quantization (q4/q8) + continuous batching + persistent disk cache. Plus speculative decoding, Mamba/SSM hybrid support, 50+ auto-detected architectures, 14 tool call parsers, 4 reasoning parsers, and 20+ built-in agentic tools. No other MLX app has even two of these.
5-Layer Caching Stack — Works with VL Models
The only MLX engine that combines prefix caching, paged KV cache, KV cache quantization (q4/q8), continuous batching, and persistent disk cache — and the only one where vision-language models (Qwen VL, LLaVA) work with all five layers. 9.7× faster TTFT. Multi-context caching survives conversation switches and app restarts. Competitors offer one layer at best; none support VL + caching at all.
Paged KV Cache
vLLM-style paged attention on Apple Silicon. Configurable block sizes up to 1000 blocks. Multiple conversations stay cached simultaneously — switch contexts without eviction. LM Studio uses single-slot; Ollama has no KV cache at all.
KV Cache Quantization
Storage-boundary quantization: full precision during generation, compressed to q8 (~2× savings) or
q4 (~4×) only when stored in the prefix cache. Zero quality loss during inference. Run 100K+ context
on a 16GB Mac. Reports cached_tokens in the OpenAI-compatible API response. Not available in
LM Studio, Ollama, or any other MLX app.
Continuous Batching
256 concurrent inference sequences with intelligent batch scheduling. Serve multiple clients from one Mac — team-scale local inference. LM Studio and Ollama max at 1.
Persistent Disk Cache
Prompt cache writes to disk and survives restarts. Launch vMLX and get instant warm TTFT on yesterday's conversations. Configurable size (GB) and directory. No other local MLX app persists cache to disk.
Agentic Coding Tools
No other local AI app has this. 20+ built-in tools: read, write, edit, copy, move, and delete files. Search codebases with grep and glob. Execute shell commands. Run git status, diff, log, and show. Search the web via DuckDuckGo or Brave. Fetch any URL. Access the clipboard. Query the current date and time. All running locally with a configurable working directory.
OpenAI-Compatible API & Remote Endpoints
Serve 7 API endpoints locally (chat, responses, completions, embeddings, MCP, audio, cancel). Or flip to Remote Endpoint mode and connect to OpenAI, Anthropic, or any API — use vMLX's agentic tools with cloud models. One app for local and remote inference.
Voice Chat
Built-in text-to-speech on every assistant message. Click to listen to any response — hands-free AI interaction. No extra setup, no external services.
Vision & Multimodal
Attach images in chat for vision-capable models like Qwen VL and LLaVA. Paste or drag-and-drop images directly. Click to zoom. Full multimodal conversation support.
Reasoning & Thinking Blocks
Collapsible reasoning display for thinking models — DeepSeek R1, Qwen 3, GLM-4.7. See the model's chain-of-thought in a clean expandable block, separate from the final answer.
Inline Tool Calls & Live Execution
Tool calls render as expandable pills inline with the model's response — click to reveal arguments and results. Real-time status indicators show when tools are executing, generating, or complete. Git status, diff, log, and show are built in alongside file I/O, shell, search, clipboard, and date/time tools.
Auto Model Detection
Reads model architecture from config.json and auto-selects from 14 tool call parsers and 4 reasoning parsers. Recognizes 50+ model architectures including Llama, Qwen, DeepSeek, Mistral, Gemma, Phi, GLM, Mamba, and more. No manual setup — load any model and vMLX picks the right configuration.
Mamba & SSM Hybrids
First-class support for Mamba and state-space model hybrids with dedicated BatchMambaCache. Proper batch filtering, merging, and KV quantization safety across Mamba layers. No other MLX app supports SSM architectures with batched inference.
Speculative Decoding
Use a small draft model to propose tokens and a large model to verify them — faster generation with the same output quality. Configure any MLX model as the draft, set the number of speculative tokens (default 3), and watch throughput increase. Especially effective when pairing a 2B draft with a 30B+ target. No other local MLX app supports speculative decoding.
Generation Defaults
Set default temperature and top-p per session with intuitive sliders. Values persist across restarts and apply to every request unless overridden by the API caller. Fine-tune creativity vs determinism at the session level.
Embedding Endpoint
Serve a dedicated embedding model alongside your chat model. The /v1/embeddings API works
with any MLX embedding model — generate vectors for RAG, semantic search, or clustering without
switching sessions.
Auto-Update Checker
vMLX checks GitHub for new releases on startup and shows a dismissible banner when an update is available. One click to download. No forced updates, no background downloads — you stay in control.
Your model writes code,
runs git, browses the web
Built-in coding tools let local models do what only cloud AI can — read, write, and edit files, execute shell commands, run git operations, search the web, access the clipboard, query the current date/time, and fetch URLs. No other local AI app has this.
Download models directly.
Or connect to any API.
Browse and download MLX models from HuggingFace in one click — including our recommended in-house models. Or connect to OpenAI, Anthropic, or any OpenAI-compatible remote endpoint and use vMLX's agentic tools with cloud models.
More than just
chat completions
Most local MLX apps only expose a single endpoint. vMLX delivers a full OpenAI-compatible API surface with capabilities no other local app offers.
| Capability | vMLX | Other MLX Apps |
|---|---|---|
| API Endpoints | ||
| /v1/chat/completions | ✓ | ✓ |
| /v1/responses | ✓ | ✗ |
| /v1/completions (text) | ✓ | ✗ |
| /v1/embeddings | ✓ | ✗ |
| /v1/mcp/tools (MCP) | ✓ | ✗ |
| /v1/audio/* (TTS/STT) | ✓ | ✗ |
| Cancel endpoint | ✓ | ✗ |
| Security & Reasoning | ||
| API key authentication | ✓ | ✗ |
| enable_thinking | Reasoning parser separates delta.reasoning | Accepted but no separation |
| reasoning_effort | Sent to server | Not supported |
| Caching & Memory | ||
| KV cache quantization (q4/q8) | ✓ | ✗ |
| Persistent disk cache | ✓ | ✗ |
| Paged multi-context KV cache | ✓ | ✗ |
| Prefix caching | ✓ | Partial |
| Agentic & Tools | ||
| Built-in coding tools (file I/O, shell, search) | ✓ | ✗ |
| Git tools (status, diff, log, show) | ✓ | ✗ |
| Web search (DuckDuckGo / Brave) | ✓ | ✗ |
| URL fetch | ✓ | ✗ |
| Clipboard access (read/write) | ✓ | ✗ |
| Date/time tool | ✓ | ✗ |
| Inline tool call UI with live status | ✓ | ✗ |
| Tool calls / function calling | ✓ | ✓ |
| Chat & Multimodal | ||
| Voice chat (TTS playback) | ✓ | ✗ |
| Vision / image input | ✓ | Partial |
| Reasoning blocks (collapsible thinking) | ✓ | ✗ |
| Auto model detection & config | ✓ | ✗ |
| Engine Capabilities | ||
| VL models + full caching stack | ✓ (5 layers) | ✗ |
| Mamba / SSM hybrid support | ✓ | ✗ |
| Tool call parsers | 14 parsers | 1–2 |
| Reasoning parsers | 4 parsers | ✗ |
| Auto architecture detection | 50+ architectures | ✗ |
| cached_tokens in API response | ✓ | ✗ |
| Storage-boundary KV quantization | q4 / q8 | ✗ |
| Speculative decoding (draft model) | ✓ | ✗ |
| Separate embedding model | ✓ | ✗ |
| Default generation params (temp, top-p) | ✓ | ✗ |
| Served model name alias | ✓ | ✗ |
| Auto-update checker | ✓ | ✗ |
| Model Management | ||
| HuggingFace model download | ✓ | ✓ |
| Remote API endpoint (OpenAI, etc.) | ✓ | ✗ |
| Multi-model listing | ✓ | ✓ |
vMLX is the only MLX inference engine where vision-language models work with a full 5-layer caching stack (prefix + paged KV + KV quantization + continuous batching + disk cache). It supports speculative decoding, Mamba/SSM hybrids, 14 tool call parsers, 4 reasoning parsers, 50+ auto-detected architectures, separate embedding models, and reports cached_tokens in the API. No other local MLX app has any of these engine capabilities.
Built for Apple Silicon
Optimized for unified memory architecture. Run Llama, DeepSeek, Qwen, Gemma, and Mistral locally with maximum throughput.
vMLX vs LM Studio
Real benchmarks on Apple M3 Ultra (256 GB) with Llama 3.2 3B Instruct 4-bit.
Flags:
--continuous-batching --enable-prefix-cache --use-paged-cacheCache: Paged KV cache, multi-context, optional q4/q8 quantization
API: OpenAI /v1/chat/completions (streaming)
Flags: Default settings (auto prefix caching)
Cache: Single-slot (1 active context)
API: OpenAI /v1/chat/completions (streaming)
| Metric | vMLX | LM Studio MLX |
|---|---|---|
| ~2.5K Token Context | ||
| Cold TTFT | 0.50s | — |
| Warm TTFT (cached) | 0.05s | — |
| Cache Speedup | 9.7× | — |
| ~10K Token Context | ||
| Cold TTFT | 0.12s | 6.12s |
| Warm TTFT (cached) | 0.08s | 0.29s |
| Cache Speedup | 1.6× | 21× |
| ~50K Token Context | ||
| Cold TTFT | 0.30s | — |
| Warm TTFT (cached) | 0.22s | — |
| Cache Speedup | 1.4× | — |
| ~100K Token Context | ||
| Cold TTFT | 0.65s | 131.06s |
| Warm TTFT (cached) | 0.45s | 1.14s |
| Cold PP/s | 154,121 | 686 |
| Warm PP/s | 222,462 | 78,635 |
| Architecture | ||
| Cache type | Paged multi-context | Single-slot |
| Multi-conversation | ✓ concurrent caching | ✗ evicts on switch |
| Concurrent sequences | Up to 256 | 1 |
All measurements: TTFT via streaming OpenAI-compatible API. Cold = first request, no cache. Warm = same
prefix cached.
vMLX flags: --continuous-batching --enable-prefix-cache --use-paged-cache.
LM Studio: default MLX engine settings.
Model: mlx-community/Llama-3.2-3B-Instruct-4bit.
Hardware: Apple M3 Ultra, 256 GB unified memory. Feb 2026.
Cold = first request (full processing). Warm = same prefix cached. Up to 18.6× faster at 50K tokens.
8-turn coding conversation with 12K system prompt. After turn 1, 99%+ tokens served from cache.
Up and running in seconds
Download vMLX, install the MLX inference backend with one click, pick any model from HuggingFace — including our own MLX-optimized models — and start generating. No cloud, no API keys, no Docker.
- ✓ One-click vMLX Engine installer
- ✓ Download any MLX-compatible model
- ✓ Auto-detects model architecture & configures parsers
- ✓ Voice chat, vision, reasoning blocks built in
- ✓ 20+ agentic tools (file, shell, git, search, clipboard)
- ✓ OpenAI-compatible API on localhost
- ✓ Code-signed — no Gatekeeper warnings
Every parameter
at your fingertips
Fine-tune the MLX inference pipeline. KV cache quantization (q4/q8), persistent disk cache, paged cache blocks, prefill batch sizes, speculative decoding, generation defaults — vMLX exposes 30+ configuration flags across 8 settings panels.
MLX models quantized
and tested by us
We quantize and optimize models specifically for MLX on Apple Silicon. Every model is tested in vMLX before release to ensure compatibility, accuracy, and performance.
Qwen3.5-VL-9B CRACK
Abliterated Qwen 3.5 Vision-Language 9B in 8-bit. Uncensored multimodal — analyze images and generate text freely.
Qwen3.5-397B-A17B REAP
Massive 397B MoE with only 17B active. REAP-pruned and 4-bit quantized — frontier-class intelligence on a Mac Studio.
Qwen3.5-VL-9B CRACK
4-bit abliterated Qwen 3.5 VL 9B. Fits in 16GB RAM — uncensored vision model for any Mac.
Qwen3.5-VL-35B-A3B CRACK
Abliterated VL MoE — 35B total, 3B active. High-quality vision with minimal memory footprint.
Qwen3.5-VL-397B-A17B REAP
The largest VL model on MLX. 397B MoE with vision — REAP-pruned to 4-bit for high-memory Macs.
Qwen3.5-VL-2B CRACK
Tiny abliterated VL model — runs on 8GB Macs. Great for speculative decoding or quick vision tasks.
All models are MLX-native, quantized in-house, and tested for compatibility with vMLX. Download directly from Hugging Face and use in vMLX instantly.
Questions & answers
What is the best app to run AI locally on a Mac?+
vMLX is the most complete MLX inference engine for Mac. Unlike LM Studio or Ollama, vMLX provides a 5-layer caching stack (prefix + paged KV + q4/q8 quantization + continuous batching + disk cache), VL model support with full caching, speculative decoding, Mamba/SSM hybrids, 50+ auto-detected architectures, 14 tool call parsers, and built-in agentic coding tools with MCP integration. Free, no cloud connection, works on any M1+ Mac.
How does vMLX compare to LM Studio and Ollama?+
At 100K token context, vMLX achieves 154,121 prompt tokens/sec (cold) vs LM Studio's 686 tok/s. vMLX uses paged multi-context KV caching (concurrent conversations stay cached), while LM Studio uses single-slot caching that evicts on switch. vMLX supports up to 256 concurrent sequences vs 1 for LM Studio. All three offer OpenAI-compatible APIs, but only vMLX exposes all 23 inference parameters.
Can I run DeepSeek, Llama, Qwen, or Gemma locally?+
Yes. vMLX supports any MLX-compatible model from HuggingFace including DeepSeek V3, Llama 3/4, Qwen 2.5/3, Gemma 3, Mistral, Phi, and hundreds more. We also publish our own abliterated and REAP-optimized MLX models at huggingface.co/dealignai — including Qwen 3.5 VL CRACK (uncensored vision models) and Qwen 3.5 397B REAP (pruned MoE) in 4-bit and 8-bit. Models run entirely on your Mac's Apple Silicon GPU. A 16GB Mac handles up to ~20B parameters, while 64GB+ handles 70B+ models.
What is prefix caching and why does it matter?+
Prefix caching stores computed KV states from previous prompt processing. When you send a new message that shares the same system prompt or history, cached tokens are reused instantly. In benchmarks, this reduces TTFT by up to 9.7x on 2.5K context. Critical for multi-turn conversations and agentic workflows. Combined with KV cache quantization (q4/q8), you can cache even longer contexts in less memory.
Do I need internet or API keys?+
No. vMLX runs entirely on your Mac with zero cloud dependency. No API keys, no subscriptions, no rate limits. Your conversations and model weights stay 100% local and private. Internet is only needed to download models initially.
What Mac hardware do I need?+
Any Mac with Apple Silicon (M1, M2, M3, M4, M5 or later). More unified memory = larger models: 8GB handles ~3-7B, 16GB up to ~20B, 32-64GB handles 30-70B, and 128-512GB runs the largest open models at full precision.
How do I use vMLX as a ChatGPT alternative on Mac?+
Download vMLX, pick a model like Llama 3, Qwen 3, or DeepSeek V3, and use the built-in chat interface. Unlike ChatGPT, everything runs locally on your Mac — no subscription, no usage limits, no data sent to any server. You get the same chat experience with complete privacy and zero cost.
What is agentic AI and does vMLX support it?+
Agentic AI lets language models call external tools autonomously. vMLX has native MCP (Model Context Protocol) support with built-in tools that let your model read, write, and edit files, execute shell commands, run browser automation, search the web, and perform multi-step coding tasks — all running locally on your Mac. Configure tool iterations, tool-choice modes, and working directories. Combined with OpenAI-compatible function calling, vMLX is a complete local agentic AI platform.
Can I use vMLX with Cursor, Continue, or other AI coding tools?+
Yes. vMLX exposes an OpenAI-compatible API at localhost:8000. Point any tool that supports custom OpenAI endpoints — Cursor, Continue, Aider, Open Interpreter, LangChain, or custom scripts — to your local vMLX server. All inference stays on your machine with zero latency and no API costs.
Is vMLX better than Ollama for Mac?+
vMLX is purpose-built for Apple Silicon using the MLX framework, while Ollama uses llama.cpp. vMLX provides a 5-layer caching stack (prefix + paged KV + q4/q8 quantization + continuous batching + disk cache), speculative decoding, Mamba/SSM support, 50+ auto-detected architectures, 14 tool call parsers, and a native macOS GUI — features Ollama lacks. For Mac-native performance and developer experience, vMLX is the superior choice.
Does vMLX support voice chat and text-to-speech?+
Yes. Every assistant message has a built-in text-to-speech button. Click it to listen to any response hands-free. No external services or API keys required — it uses your Mac's native speech synthesis.
Can I use vision models and send images in vMLX?+
Yes. vMLX supports multimodal models like Qwen VL and LLaVA. Paste or drag-and-drop images directly into the chat. Images are displayed inline with click-to-zoom, and the model can analyze and respond to visual content. All processing stays on your Mac.
What are reasoning blocks and which models support them?+
Reasoning blocks show the model's chain-of-thought in a collapsible section, separate from the final answer. Supported by thinking models like DeepSeek R1, Qwen 3, and GLM-4.7 Flash. vMLX auto-detects reasoning-capable models and configures the right parser automatically — no manual setup needed.
Why is vMLX the best MLX inference engine?+
vMLX is the only MLX engine that combines vision-language model support with a full 5-layer caching stack
(prefix cache, paged KV cache, KV cache quantization, continuous batching, persistent disk cache). It also
supports speculative decoding, Mamba/SSM hybrid architectures, auto-detects 50+ model architectures, has
14 tool call parsers and 4 reasoning parsers, reports cached_tokens in the
OpenAI-compatible API, and uses storage-boundary quantization (full precision during generation,
compressed only in storage). No other MLX app — LM Studio, Ollama, mlx-community tools, or any
other — offers even a fraction of these engine capabilities.
Does vMLX support Mamba and state-space models?+
Yes. vMLX has first-class Mamba and SSM hybrid support with a dedicated BatchMambaCache that handles batch filtering, merging, and KV quantization safety across Mamba layers. This means Mamba-based models work with continuous batching and the full caching stack. No other MLX inference app supports SSM architectures with batched inference.
What is speculative decoding and does vMLX support it?+
Speculative decoding uses a small, fast draft model to propose candidate tokens, which the larger target model then verifies in parallel. This can significantly speed up generation without sacrificing output quality. In vMLX, configure any MLX model as the draft (e.g., a 2B model for a 30B+ target), set the number of draft tokens (default 3), and get faster responses. No other local MLX app supports this.
Does vMLX auto-update?+
vMLX checks GitHub for new releases on startup and shows a dismissible banner if an update is available. Click the download link to get the latest version. There are no forced updates, no background downloads, and no telemetry — you control when and whether to update.
Can I serve embeddings alongside my chat model?+
Yes. vMLX lets you configure a separate embedding model in the session settings. The
/v1/embeddings endpoint uses this dedicated model, so you can generate embeddings for RAG
pipelines or semantic search without stopping your chat model. No other local MLX app supports this.
What built-in tools does vMLX include for agentic AI?+
vMLX includes 20+ built-in tools across 7 categories: File I/O (read, write, edit, copy, move, delete, list), Code Search (grep, glob), Shell (execute commands), Web Search (DuckDuckGo and Brave), URL Fetch, Git (status, diff, log, show), and Utilities (clipboard read/write, current date/time). All tools run locally with a configurable working directory and iteration limits.
The most complete MLX engine. Free.
224× faster than LM Studio. VL + full caching stack. Speculative decoding. Mamba. 50+ architectures.
14 tool parsers. 20+ agentic tools. Voice. Vision. Embeddings.
No cloud. No API keys. No rate limits. No competition.