MLX Studio is a free macOS app for AI chat and agentic coding, powered by vMLX Engine. It is the only on-device AI app on Mac with a full 5-layer caching stack (prefix cache, paged KV, KV quantization q4/q8, continuous batching, persistent disk cache), hybrid SSM/Mamba architecture support, and 20+ built-in agentic tools via MCP. No competing app — not oMLX, not LM Studio, not Inferencer — combines all of these capabilities.

How is MLX Studio better than oMLX?

oMLX has SSD caching, continuous batching, prefix caching, tool calling, and a Responses API (added March 2025). But MLX Studio has KV cache quantization (q4/q8) saving 2-8x cache memory, hybrid SSM/Mamba support for Nemotron-H and Jamba models, 20+ built-in agentic coding tools (file I/O, shell, web search, git, browser automation), 14 auto-detected tool call parsers, 4 reasoning parsers with collapsible UI, speculative decoding, and audio TTS/STT — none of which oMLX offers. oMLX added VLM support but without KV quantization. MLX Studio's engine is strictly more capable.

How is MLX Studio better than LM Studio?

MLX Studio is 224x faster than LM Studio at 100K context (154,121 vs 686 tokens/sec). LM Studio uses llama.cpp with basic token caching but has no KV cache quantization, no persistent disk cache, no prefix caching for instant TTFT, and no built-in agentic tools. MLX Studio also has 14 tool call parsers, 4 reasoning parsers, hybrid SSM/Mamba support, and both Responses and Chat Completions APIs — LM Studio has none of these.

How is MLX Studio better than Inferencer?

Inferencer is a freemium app focused on token inspection. It has no caching stack (no prefix cache, no KV quantization, no disk cache, no continuous batching), no agentic coding tools, no tool call parsers, no reasoning parsers, no Responses API, no hybrid SSM/Mamba support, and no speculative decoding. MLX Studio is free and has all of these features.

Does it require internet?

Only to download models initially. All inference runs locally on your Mac with no cloud connection, no API keys, and no data leaving your device. Unlike LM Studio and Inferencer which have cloud/subscription tiers, MLX Studio is completely free with no cloud dependency.

Any Mac with Apple Silicon (M1 or later) running macOS 14.5 or later. 8 GB RAM minimum, 16 GB+ recommended. Remote endpoints work on macOS 14+. MLX Studio is purpose-built for Apple Silicon unified memory using MLX, not llama.cpp like LM Studio or Ollama.

Can I use cloud APIs too?

Yes. Connect to OpenAI, Anthropic, Groq, or any OpenAI-compatible endpoint. Studio's 20+ agentic tools work with both local and remote models — a capability no other on-device AI app offers.

Free · macOS · Apple Silicon · All-in-One

The all-in-one AI app
for your Mac.

Name: MLX Studio
Availability: InStock
Author: MLX Studio

Chat, code, generate & edit images, convert models, serve APIs, and reason — all running locally. The only Mac AI app with 20+ agentic tools, Flux image generation + editing (Kontext, Fill, Qwen), Anthropic + OpenAI APIs, JANG & GGUF-to-MLX model converter, 5-layer caching, voice chat, vision models, speculative decoding, and 50+ architectures. No cloud, no API keys, no subscriptions.

Download for Mac Download DMG

JANG_QThe only MLX engine with mixed-precision quantization — 74% MMLU on 230B at 2 bits (82.5 GB) vs MLX 4-bit 26.5% (119.8 GB)

Free Forever Apple Silicon No Cloud 20+ Agentic Tools Image Gen Anthropic + OpenAI APIs Model Converter 50+ Architectures Voice & Vision pip install vmlx

20+

Agentic Tools

224×

Faster at 100K context

50+

Model Architectures

API Endpoints

FEATURES

Everything you need. Nothing in the cloud.

Chat with any model, generate images with Flux, write code with 20+ agentic tools, use Anthropic or OpenAI APIs, convert models between formats — all running locally on your Mac. No API keys, no subscriptions, no data leaving your machine. Built for beginners who want a simple chat app and advanced users who need a full inference stack with KV cache quantization, prefix caching, speculative decoding, and 14 tool parsers.

Streaming Chat UI

Multi-turn streaming conversations with inline tool call pills, collapsible reasoning blocks, image previews, and real-time status indicators. Every detail crafted for clarity.

Image Generation & Editing

Generate and edit images locally. 5 generation models (Flux Schnell, Dev, Klein) + 4 editing models (Qwen Image Edit, Flux Kontext, Flux Fill). No cloud, no API keys.

Voice Chat

Built-in text-to-speech on every response. Listen to AI output hands-free using native Mac speech synthesis.

Vision & Multimodal

Drag and drop images into chat. Vision models like Qwen VL analyze visual content locally with click-to-zoom previews.

Reasoning Blocks

Collapsible thinking sections for models like DeepSeek R1, Qwen 3, and GLM. See the model's chain of thought.

Anthropic + OpenAI APIs

Native Anthropic Messages API endpoint alongside OpenAI Chat and Responses APIs. Use Claude Code, OpenClaw, Anthropic SDK, or any compatible client. Connect to remote endpoints too.

Model Converter

Built-in GGUF-to-MLX converter with standard profiles (Balanced 4-bit, Quality 8-bit, Compact 3-bit) and JANG mixed-precision profiles (2S through 6M). Convert any model without the command line.

HuggingFace Browser

Search, browse, and download MLX models directly in the app. One click to start chatting with any model.

5-Layer Caching Stack

Prefix cache, paged multi-context KV, KV quantization (q4/q8), continuous batching (256 sequences), and persistent disk cache. No other local app combines all five.

Speculative Decoding

Configurable draft model for 20–90% faster generation. The large model verifies draft tokens in parallel — same quality, fewer GPU passes.

50+ Architectures & 14 Parsers

Auto-detects Llama, Qwen, DeepSeek, Gemma, Mistral, Phi, GLM, Nemotron, MiniMax, Jamba, and more. 14 tool call parsers, 4 reasoning parsers — no manual configuration.

CLI: pip install vmlx

Open-source engine. pip install vmlx then vmlx serve model. Convert, benchmark, diagnose from terminal. Apache 2.0.

MCP Native Support

Built-in MCP (Model Context Protocol) server. Connect external MCP tools alongside the 20+ built-in tools. Auto-continue agent loops up to 10 iterations.

Hybrid SSM & Mamba

Dedicated BatchMambaCache for Nemotron-H, Jamba, and GatedDeltaNet architectures. The only local app that runs hybrid attention + SSM models correctly.

AGENTIC TOOLS

20+ built-in tools. Zero configuration.

The only on-device AI app with native MCP tool calling. Models can read, write, search, and execute — all running locally. oMLX, LM Studio, and Inferencer have no built-in agentic tools.

MLX Studio — Agentic Tools

File I/O

read_file write_file edit_file list_dir copy move delete

Code Search

grep glob

Shell

execute_command

Web Search

duckduckgo_search brave_search

URL Fetch

fetch_url

Git

git_status git_diff git_log git_show

Utilities

clipboard_read clipboard_write current_datetime

IMAGE GENERATION

Generate & edit images locally on your Mac

5 image generation models (Flux Schnell, Dev, Z-Image Turbo, Klein 4B, Klein 9B) and 4 image editing models (Qwen Image Edit, Flux Kontext, Flux Fill, Flux Klein Edit). Submit a photo + text prompt to inpaint, transform, or restyle. Models download automatically. No cloud APIs, no subscriptions — runs entirely on Apple Silicon.

MLX Studio — Image Generation & Editing

MLX Studio image generation and editing — Flux Schnell, Dev, Z-Image Turbo, Klein, Qwen Image Edit, Flux Kontext, Flux Fill

CHAT

Streaming chat with reasoning & vision

Multi-turn conversations with collapsible reasoning blocks, inline code highlighting, image previews, and real-time token streaming. Drag and drop images for vision models. Per-chat temperature, top-p, system prompt, and max tokens. Chat history persisted in SQLite.

MLX Studio — Chat

MODELS

Browse & download models in one click

Built-in HuggingFace model browser. Search MLX models, filter by text or image, see sizes and architectures, and download with one click. Pre-quantized JANG models from JANGQ-AI ready to run.

MLX Studio — Model Browser

MLX Studio HuggingFace model browser showing JANGQ-AI pre-quantized models

ALWAYS ACCESSIBLE

Menu bar controls

Live server status, quick model switching, and session controls — always one click away in your menu bar. Start/stop models, check GPU usage, and manage sessions without opening the main window.

MLX Studio menu bar showing server status and quick controls

vMLX Engine

Now open source at github.com/jjang-ai/vmlx — install with pip install vmlx. The only on-device AI engine on Mac with a 5-layer caching stack: prefix cache, paged KV, KV quantization (q4/q8), continuous batching, and persistent disk cache. Serves both Anthropic Messages API and OpenAI-compatible endpoints — use Claude Code, OpenClaw, Anthropic SDK, or any compatible client. 50+ architectures, 14 tool parsers, 4 reasoning parsers, Mamba/SSM hybrids, speculative decoding.

2.5K context

vMLX 0.05s

Others 0.49s

9.7× faster

Time to first token

10K context

vMLX 0.08s

Others 6.12s

76× faster

Time to first token

100K context

vMLX 0.65s

Others 131s

224× faster

Cold prompt processing

✓ Prefix caching — repeated parts of your conversation are computed once and reused

✓ Paged KV cache — all your chats stay in memory, no eviction when switching

✓ Cache quantization — q4/q8 reduces cache memory 4–8×, enabling longer contexts

✓ Continuous batching — handles up to 256 concurrent sequences efficiently

✓ Disk cache — prompt computations survive app restarts for instant warm starts

✓ Apple Silicon native — built on MLX, not llama.cpp, optimized for unified memory

API Reference — Anthropic + OpenAI Endpoints

MLX Studio — API Reference

Model Converter — GGUF-to-MLX & JANG Profiles

MLX Studio — Model Converter

MLX Studio GGUF-to-MLX model converter with standard and JANG quantization profiles

FAQ

Frequently asked questions

MLX Studio is a free macOS app for AI chat and agentic coding. It is the only on-device AI app on Mac with native prefix caching, paged KV cache, KV quantization, continuous batching, hybrid SSM support, and full VLM integration. It includes 20+ built-in tools for file editing, code search, shell execution, web search, and more — all powered by the vMLX Engine running locally on Apple Silicon.

MLX Studio is the app — the chat UI, agentic tools, model browser, and settings interface you interact with. vMLX Engine is the inference backend that powers it — the caching, batching, model loading, and API layer. Think of it like LM Studio and llama.cpp.

Only to download models initially. All inference runs locally on your Mac with no cloud connection, no API keys, and no data leaving your device.

Any Mac with Apple Silicon (M1 or later) running macOS 14.5 or later. 8 GB RAM minimum, 16 GB+ recommended. Remote endpoints work on macOS 14+.

Yes. Connect to OpenAI, Anthropic, Groq, or any OpenAI-compatible endpoint. Studio's agentic tools work with both local and remote models.

20+ tools across 7 categories: file I/O (read, write, edit, copy, move, delete), code search (grep, glob), shell execution, web search (DuckDuckGo, Brave), URL fetch, git (status, diff, log), and utilities (clipboard, date/time).

Yes. Every response has a TTS playback button. Vision models like Qwen VL accept image input via drag-and-drop with inline previews.

Yes. Completely free, code-signed, and notarized. No subscriptions, no usage limits.

The GGUF for MLX · Open Source

JANG — Better Quality at Every Size

GGUF gave llama.cpp K-quants. JANG does the same for MLX — smart bit allocation that protects attention layers. On Qwen3.5-122B: 94% MMLU (JANG_4K, 69 GB) vs 90% for MLX 4-bit (64 GB). At 2 bits: 84% MMLU (38 GB) vs 46% for MLX mixed_2_6 (44 GB).

JANG_4K · 122B · 3.99b
94% MMLU · 69 GB
+4 points vs MLX 4-bit (64 GB, 90%)
JANG_2S · 122B · 2.11b
84% MMLU · 38 GB
+38 points vs mixed_2_6 (44 GB, 46%)

MiniMax-M2.5 (230B) — JANG vs MLX

200-question MMLU

JANG_2L · 2.10 bits · 82.5 GB

74%

MLX 4-bit · 119.8 GB

26.5%

JANG at 2 bits scores 3x higher than MLX at 4 bits using 37 GB less RAM. MLX is broken at all bit levels on this model.

View per-subject MMLU breakdown (10 topics)

Subject	JANG_2L	4-bit	3-bit	2-bit
Abstract Algebra	10/20	3/20	2/20	5/20
Anatomy	15/20	7/20	5/20	5/20
Astronomy	20/20	7/20	6/20	4/20
College CS	13/20	4/20	5/20	6/20
College Physics	13/20	8/20	6/20	6/20
HS Biology	18/20	4/20	5/20	6/20
HS Chemistry	18/20	4/20	5/20	5/20
HS Mathematics	8/20	6/20	6/20	3/20
Logical Fallacies	18/20	5/20	4/20	5/20
World Religions	15/20	5/20	5/20	5/20
Total	148/200 (74%)	53/200 (26.5%)	49/200 (24.5%)	50/200 (25%)

JANG wins all 10 subjects. MLX 4/3/2-bit all score near random (25%). Root cause: MLX generates meta-commentary instead of answers.

jangq.ai GitHub

The all-in-one AI app for your Mac.