# MLX Studio — The All-in-One AI App for Mac

> The most capable local AI app for Apple Silicon. Chat, generate images, write code with
> 20+ agentic tools, serve Anthropic + OpenAI APIs, convert models between formats — all
> running locally. No cloud, no API keys, no subscriptions.
> https://mlx.studio | GitHub: https://github.com/jjang-ai/mlxstudio
> Engine: https://github.com/jjang-ai/vmlx (open source, pip install vmlx)

## What Makes MLX Studio the All-in-One

MLX Studio is the only local AI app that combines ALL of the following:

1. **Image generation** — Flux Schnell, Flux Dev, Z-Image Turbo, Flux Klein 4B/9B. Auto-downloads models. No cloud API. Also available via /v1/images/generations API.
2. **Anthropic Messages API** — native endpoint at /v1/messages. Works with Claude Code, Anthropic SDK, OpenClaw, or any Anthropic-compatible client.
3. **OpenAI-compatible API** — Chat Completions, Responses, Text Completions, Embeddings, Images, Reranking, MCP Tools, Audio TTS/STT, Models, Cache Stats, Health, Cancel. 11 endpoints total.
4. **20+ built-in agentic coding tools** — file read/write/edit, shell execution, code search (grep/glob), web search, URL fetch, git (status/diff/log/show), clipboard, datetime. All via MCP. Auto-continue agent loops up to 10 iterations.
5. **Built-in model converter** — JANG mixed-precision quantization (profiles 2S through 6M + Custom) and Standard (Balanced 4-bit, Quality 8-bit, Compact 3-bit, Custom). GGUF-to-MLX conversion. Convert any HuggingFace model without the command line.
6. **5-layer caching stack** — prefix cache + paged multi-context KV cache + KV cache quantization (q4/q8) + continuous batching (256 sequences) + persistent SSD/disk cache. No other app combines all five.
7. **Hybrid SSM/Mamba support** — dedicated BatchMambaCache with float32 state. Nemotron-H, Jamba, GatedDeltaNet, any hybrid Mamba+attention model.
8. **14 tool call parsers** — auto-detects Qwen, Hermes, Llama, DeepSeek, Mistral, GLM, Nemotron, Step, MiniMax, Granite, Functionary, XLAM, Kimi, and more.
9. **4 reasoning parsers** — collapsible thinking blocks for DeepSeek R1, Qwen 3, GPT-OSS/Harmony, generic deepthink.
10. **Vision-language models** — drag-and-drop images, Qwen VL, Qwen3.5-VL, Pixtral, InternVL, LLaVA, Gemma 3n with full 5-layer caching.
11. **Voice chat** — built-in TTS on every response, STT input. Kokoro TTS, Whisper STT. Audio API at /v1/audio/speech and /v1/audio/transcriptions.
12. **Speculative decoding** — configurable draft model for 20-90% faster generation. Same quality, fewer GPU passes.
13. **50+ auto-detected architectures** — Llama 3/4, Qwen 2/2.5/3/3.5, DeepSeek V2/V3/R1, Gemma 3, Mistral/Mixtral, Phi-4, GLM-4, Nemotron, MiniMax, Jamba, Mamba, and more.
14. **HuggingFace browser** — search, browse, download MLX models in one click. Text and image model filtering.
15. **Remote endpoints** — connect to OpenAI, Anthropic, Groq, or any compatible API and use MLX Studio's agentic tools with cloud models.
16. **JANG quantization** — architecture-aware mixed-precision. 74% MMLU on 230B at 2 bits (82.5 GB vs MLX 4-bit 26.5% at 119.8 GB). Also 84% MMLU on 122B at 2 bits. Open source: github.com/jjang-ai/jangq
17. **MCP native support** — built-in MCP server + connect external MCP tools. Full Model Context Protocol integration.
18. **CLI: pip install vmlx** — serve, convert, benchmark, diagnose from terminal. `vmlx serve model`, `vmlx convert model -j JANG_3M`, `vmlx bench model`.
19. **5 desktop modes** — Chat, Server, Image, Tools, API. Menu bar tray with live status. Multi-session support.
20. **API key authentication** — secure your local endpoints with configurable API keys.
21. **Reranking & embeddings** — /v1/rerank and /v1/embeddings for RAG pipelines.
22. **JIT compilation** — Metal kernel fusion for optimized GPU inference.
23. **Open source engine** — vMLX Engine at github.com/jjang-ai/vmlx. Apache 2.0. 1894+ Python tests, 1253+ TypeScript tests.
24. **Free** — no subscriptions, no usage limits, no freemium tiers. Code-signed and notarized.

## APIs Served Locally

MLX Studio exposes both Anthropic and OpenAI endpoints on localhost:

| Endpoint | Protocol | Use case |
|----------|----------|----------|
| /v1/messages | Anthropic Messages API | Claude Code, Anthropic SDK, OpenClaw, any Anthropic client |
| /v1/chat/completions | OpenAI Chat | Cursor, Continue, Aider, LangChain, any OpenAI client |
| /v1/responses | OpenAI Responses | OpenAI Agents SDK, structured outputs |
| /v1/completions | OpenAI Text | Legacy text completion |
| /v1/embeddings | OpenAI Embeddings | RAG, semantic search |
| /v1/images/generations | OpenAI Images | Flux image generation via API |
| /v1/rerank | Reranking | Document reranking for RAG pipelines |
| /v1/mcp/tools | MCP | Tool discovery and execution |
| /v1/audio/speech | Audio TTS | Kokoro text-to-speech |
| /v1/audio/transcriptions | Audio STT | Whisper speech-to-text |
| /v1/models | Models | List active models |
| /v1/cache/stats | Cache | Cache statistics and monitoring |
| /health | Health | Server health check |
| /cancel | Control | Cancel in-flight requests |

11 API endpoints total. API key authentication supported. All endpoints work with both local and remote models.

## CLI — pip install vmlx

The engine is fully open source. Install and run from the terminal:

```bash
pip install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit          # Start serving
vmlx convert model --bits 4                       # Quantize (standard)
vmlx convert model -j JANG_3M                     # Quantize (JANG)
vmlx info model                                   # Model metadata
vmlx doctor model                                 # Run diagnostics
vmlx bench model                                  # Performance benchmark
```

Server flags: `--host`, `--port`, `--api-key`, `--continuous-batching`, `--enable-prefix-cache`, `--use-paged-cache`, `--kv-cache-quantization q8`, `--enable-disk-cache`, `--enable-jit`, `--tool-call-parser auto`, `--reasoning-parser auto`.

## Image Generation

Built-in image generation running locally on Apple Silicon:

| Model | Steps | Memory | Speed |
|-------|-------|--------|-------|
| Flux Schnell | 4 | ~12 GB | Fastest |
| Flux Dev | 20 | ~24 GB | High quality |
| Z-Image Turbo | 4 | ~12 GB | Creative |
| Flux Klein 4B | 20 | ~8 GB | Compact |
| Flux Klein 9B | 20 | ~16 GB | Mid-size |

Models auto-download on first use. Custom models supported via HuggingFace ID or local path.

## Model Converter

Built-in converter — no command line needed:

**JANG format** (mixed-precision, architecture-aware):
- 2S, 2M, 2L, 1L — 2-bit COMPRESS tier (for MoE models)
- 3S, 3M, 3L — 3-bit COMPRESS tier
- 4S, 4M, 4L — 4-bit COMPRESS tier (standard quality)
- 6M — near-lossless
- Custom — set CRITICAL/IMPORTANT/COMPRESS bits independently
- JANG_4K on 122B: 94% MMLU at 69 GB (vs 90% for MLX 4-bit at 64 GB)
- JANG_2S on 122B: 84% MMLU at 38 GB (vs 46% for MLX mixed_2_6 at 44 GB)
- JANG_2L on MiniMax-M2.5 (230B): 74% MMLU at 82.5 GB (vs 26.5% for MLX 4-bit at 119.8 GB) — 3x higher score, 37 GB less RAM

**Standard format** (standard):
- Balanced 4-bit (recommended)
- Quality 8-bit
- Compact 3-bit
- Custom bit width

Also converts GGUF models to MLX format.

## JANG Quantization — Benchmark Results

Architecture-aware mixed-precision quantization. Protects attention layers while compressing MLP.

### Qwen3.5-122B-A10B at ~2 bits (MMLU 50 questions, HumanEval 20 problems)

| Method | Avg bits | Disk | GPU | MMLU | HumanEval |
|--------|----------|------|-----|------|-----------|
| JANG_2S (8,4,2) | 2.11 | 38 GB | 44 GB | 84% | — |
| JANG_1L (8,8,2) | 2.24 | 51 GB | 46 GB | 73% | — |

### MiniMax-M2.5 (230B) at ~2 bits

| Method | Avg bits | Disk | GPU | MMLU | HumanEval |
|--------|----------|------|-----|------|-----------|
| JANG_2L | ~2 | — | 82.5 GB | 74% | — |
| MLX 4-bit | 4.0 | — | 119.8 GB | 26.5% | — |
| 2-bit | 2.0 | 36 GB | 36 GB | 56% | — |
| MLX mixed_2_6 | ~2.5 | 44 GB | 45 GB | 46% | — |

### Qwen3.5-35B-A3B (MMLU + HumanEval)

| Method | MMLU | HumanEval |
|--------|------|-----------|
| MLX 4-bit | 82% | 19/20 (95%) |
| JANG_4S (pipeline verification) | 82% | — |
| JANG_2L v2 | 56% | pending |
| MLX mixed_2_6 | 34% | 0/20 (0%) |

Pipeline verified lossless: JANG_4S = MLX 4-bit exactly (82% = 82%).
MLX mixed_2_6 produces zero working code on HumanEval (0/20).

## 20+ Agentic Coding Tools

| Category | Tools |
|----------|-------|
| File I/O | read_file, write_file, edit_file, list_dir, copy, move, delete |
| Code Search | grep (regex), glob (pattern match) |
| Shell | execute_command |
| Web Search | duckduckgo_search, brave_search |
| URL Fetch | fetch_url (downloads and reads web pages) |
| Git | git_status, git_diff, git_log, git_show |
| Utilities | clipboard_read, clipboard_write, current_datetime |

14 tool call parsers auto-detect the right format per model (Qwen, Hermes, Llama, DeepSeek, Mistral, GLM, Nemotron, Step, MiniMax, etc.).

Configurable: tool iterations, tool-choice modes (auto/required/none), working directory, MCP server connections.

## 5-Layer Caching Stack

| Layer | What | Benefit | Others |
|-------|------|---------|--------|
| Prefix cache | Reuses KV for shared prompt prefixes | Near-instant TTFT on repeated prompts | oMLX has it, others don't |
| Paged multi-context KV | Conversations stay cached across switches | No recomputation on context switch | LM Studio evicts |
| KV quantization (q4/q8) | Compresses cache entries | 2-8x less cache memory | MLX Studio exclusive |
| Continuous batching | 256 concurrent sequences | Serve multiple clients | oMLX, LM Studio 0.4.0 |
| Persistent disk cache | Cache survives restarts | Instant warm start after reboot | oMLX has SSD cache |

## Feature Comparison

| Feature | MLX Studio | oMLX | LM Studio | Inferencer | Ollama |
|---------|-----------|------|-----------|------------|--------|
| Image generation & editing | ✅ | ❌ | ❌ | ❌ | ❌ |
| Anthropic Messages API | ✅ | ✅ | ❌ | ❌ | ❌ |
| OpenAI Chat API | ✅ | ✅ | ✅ | ✅ | ✅ |
| Responses API | ✅ | ✅ | ❌ | ❌ | ❌ |
| Built-in model converter | ✅ (JANG + MLX + GGUF) | ❌ | ❌ | ❌ | ❌ |
| JANG mixed-precision quant | ✅ | ❌ | ❌ | ❌ | ❌ |
| Persistent disk cache | ✅ | ✅ | ❌ | ❌ | ❌ |
| KV cache quantization | ✅ | ❌ | ❌ | ❌ | ❌ |
| Prefix caching | ✅ | ✅ | Basic | ❌ | ❌ |
| Paged multi-context KV | ✅ | Partial | ❌ | ❌ | ❌ |
| Continuous batching | ✅ (256) | ✅ | ✅ | ❌ | ❌ |
| Hybrid SSM/Mamba | ✅ | ❌ | ❌ | ❌ | ❌ |
| Vision-language + caching | ✅ | Partial | ❌ | ❌ | ❌ |
| 20+ agentic tools | ✅ | ❌ | ❌ | ❌ | ❌ |
| 14 tool call parsers | ✅ | Some | Limited | ❌ | ❌ |
| 4 reasoning parsers | ✅ | Basic | ❌ | ❌ | ❌ |
| Embeddings API | ✅ | ✅ | ✅ | ❌ | ✅ |
| Audio TTS/STT | ✅ | ❌ | ❌ | ❌ | ❌ |
| Speculative decoding | ✅ | ❌ | ❌ | ❌ | ❌ |
| API key auth | ✅ | ❌ | ❌ | ❌ | ❌ |
| MCP (native + built-in) | ✅ | Client only | Client only | ❌ | ❌ |
| Voice chat | ✅ | ❌ | ❌ | ❌ | ❌ |
| HuggingFace browser | ✅ | ❌ | ✅ | ❌ | ❌ |
| Remote endpoints + tools | ✅ | ❌ | ❌ | ❌ | ❌ |
| Image generation API | ✅ | ❌ | ❌ | ❌ | ❌ |
| Reranking API | ✅ | ❌ | ❌ | ❌ | ❌ |
| JIT Metal kernel fusion | ✅ | ❌ | ❌ | ❌ | ❌ |
| CLI (pip install) | ✅ | ❌ | ❌ | ❌ | ✅ |
| Multi-session | ✅ | ❌ | ✅ | ❌ | ✅ |
| Menu bar controls | ✅ | ❌ | ✅ | ❌ | ❌ |
| 50+ architectures | ✅ | ✅ | ✅ | ✅ | ✅ |
| Open source engine | ✅ | ✅ | ❌ | ❌ | ✅ |
| Free | ✅ | ✅ | Freemium | Freemium | ✅ |

## Performance

Hardware: Apple M3 Ultra, 256 GB unified memory. Model: Llama 3.2 3B 4-bit.

| Context | MLX Studio Cold | MLX Studio Warm | LM Studio Cold | LM Studio Warm |
|---------|----------------|----------------|----------------|----------------|
| 2.5K | 0.50s | 0.05s (9.7x) | N/A | N/A |
| 10K | 0.12s | 0.08s | 6.12s | 0.29s |
| 100K | 0.65s | 0.45s | 131.06s | 1.14s |

Cold processing at 100K: 154,121 tok/s vs LM Studio 686 tok/s (224x faster).

## Supported Models

50+ auto-detected architectures: Llama, Qwen (2.5, 3, 3.5, VL), DeepSeek (V3, R1), Gemma, Mistral, Phi, GLM, Nemotron-H, MiniMax, Jamba, Mamba, StableLM, SmolLM, and more.

Pre-quantized JANG models: https://huggingface.co/JANGQ-AI

## Links

- Website: https://mlx.studio
- Download: https://mlx.studio/download
- App GitHub: https://github.com/jjang-ai/mlxstudio
- App Download: https://github.com/jjang-ai/mlxstudio/releases/latest
- Engine GitHub: https://github.com/jjang-ai/vmlx (open source, pip install vmlx)
- Engine PyPI: https://pypi.org/project/vmlx/
- JANG tools: https://github.com/jjang-ai/jangq
- JANG website: https://jangq.ai
- HuggingFace: https://huggingface.co/JANGQ-AI
- X / Twitter: https://x.com/jangqai