# MLX Studio — The All-in-One AI App for Mac > The most capable local AI app for Apple Silicon. Chat, generate images, write code with > 20+ agentic tools, serve Anthropic + OpenAI APIs, convert models between formats — all > running locally. No cloud, no API keys, no subscriptions. > https://mlx.studio | GitHub: https://github.com/jjang-ai/mlxstudio > Engine: https://github.com/jjang-ai/vmlx (open source, pip install vmlx) ## What Makes MLX Studio the All-in-One MLX Studio is the only local AI app that combines ALL of the following: 1. **Image generation** — Flux Schnell, Flux Dev, Z-Image Turbo, Flux Klein 4B/9B. Auto-downloads models. No cloud API. Also available via /v1/images/generations API. 2. **Anthropic Messages API** — native endpoint at /v1/messages. Works with Claude Code, Anthropic SDK, OpenClaw, or any Anthropic-compatible client. 3. **OpenAI-compatible API** — Chat Completions, Responses, Text Completions, Embeddings, Images, Reranking, MCP Tools, Audio TTS/STT, Models, Cache Stats, Health, Cancel. 11 endpoints total. 4. **20+ built-in agentic coding tools** — file read/write/edit, shell execution, code search (grep/glob), web search, URL fetch, git (status/diff/log/show), clipboard, datetime. All via MCP. Auto-continue agent loops up to 10 iterations. 5. **Built-in model converter** — JANG mixed-precision quantization (profiles 2S through 6M + Custom) and Standard (Balanced 4-bit, Quality 8-bit, Compact 3-bit, Custom). GGUF-to-MLX conversion. Convert any HuggingFace model without the command line. 6. **5-layer caching stack** — prefix cache + paged multi-context KV cache + KV cache quantization (q4/q8) + continuous batching (256 sequences) + persistent SSD/disk cache. No other app combines all five. 7. **Hybrid SSM/Mamba support** — dedicated BatchMambaCache with float32 state. Nemotron-H, Jamba, GatedDeltaNet, any hybrid Mamba+attention model. 8. **14 tool call parsers** — auto-detects Qwen, Hermes, Llama, DeepSeek, Mistral, GLM, Nemotron, Step, MiniMax, Granite, Functionary, XLAM, Kimi, and more. 9. **4 reasoning parsers** — collapsible thinking blocks for DeepSeek R1, Qwen 3, GPT-OSS/Harmony, generic deepthink. 10. **Vision-language models** — drag-and-drop images, Qwen VL, Qwen3.5-VL, Pixtral, InternVL, LLaVA, Gemma 3n with full 5-layer caching. 11. **Voice chat** — built-in TTS on every response, STT input. Kokoro TTS, Whisper STT. Audio API at /v1/audio/speech and /v1/audio/transcriptions. 12. **Speculative decoding** — configurable draft model for 20-90% faster generation. Same quality, fewer GPU passes. 13. **50+ auto-detected architectures** — Llama 3/4, Qwen 2/2.5/3/3.5, DeepSeek V2/V3/R1, Gemma 3, Mistral/Mixtral, Phi-4, GLM-4, Nemotron, MiniMax, Jamba, Mamba, and more. 14. **HuggingFace browser** — search, browse, download MLX models in one click. Text and image model filtering. 15. **Remote endpoints** — connect to OpenAI, Anthropic, Groq, or any compatible API and use MLX Studio's agentic tools with cloud models. 16. **JANG quantization** — architecture-aware mixed-precision. 74% MMLU on 230B at 2 bits (82.5 GB vs MLX 4-bit 26.5% at 119.8 GB). Also 84% MMLU on 122B at 2 bits. Open source: github.com/jjang-ai/jangq 17. **MCP native support** — built-in MCP server + connect external MCP tools. Full Model Context Protocol integration. 18. **CLI: pip install vmlx** — serve, convert, benchmark, diagnose from terminal. `vmlx serve model`, `vmlx convert model -j JANG_3M`, `vmlx bench model`. 19. **5 desktop modes** — Chat, Server, Image, Tools, API. Menu bar tray with live status. Multi-session support. 20. **API key authentication** — secure your local endpoints with configurable API keys. 21. **Reranking & embeddings** — /v1/rerank and /v1/embeddings for RAG pipelines. 22. **JIT compilation** — Metal kernel fusion for optimized GPU inference. 23. **Open source engine** — vMLX Engine at github.com/jjang-ai/vmlx. Apache 2.0. 1894+ Python tests, 1253+ TypeScript tests. 24. **Free** — no subscriptions, no usage limits, no freemium tiers. Code-signed and notarized. ## APIs Served Locally MLX Studio exposes both Anthropic and OpenAI endpoints on localhost: | Endpoint | Protocol | Use case | |----------|----------|----------| | /v1/messages | Anthropic Messages API | Claude Code, Anthropic SDK, OpenClaw, any Anthropic client | | /v1/chat/completions | OpenAI Chat | Cursor, Continue, Aider, LangChain, any OpenAI client | | /v1/responses | OpenAI Responses | OpenAI Agents SDK, structured outputs | | /v1/completions | OpenAI Text | Legacy text completion | | /v1/embeddings | OpenAI Embeddings | RAG, semantic search | | /v1/images/generations | OpenAI Images | Flux image generation via API | | /v1/rerank | Reranking | Document reranking for RAG pipelines | | /v1/mcp/tools | MCP | Tool discovery and execution | | /v1/audio/speech | Audio TTS | Kokoro text-to-speech | | /v1/audio/transcriptions | Audio STT | Whisper speech-to-text | | /v1/models | Models | List active models | | /v1/cache/stats | Cache | Cache statistics and monitoring | | /health | Health | Server health check | | /cancel | Control | Cancel in-flight requests | 11 API endpoints total. API key authentication supported. All endpoints work with both local and remote models. ## CLI — pip install vmlx The engine is fully open source. Install and run from the terminal: ```bash pip install vmlx vmlx serve mlx-community/Qwen3-8B-4bit # Start serving vmlx convert model --bits 4 # Quantize (standard) vmlx convert model -j JANG_3M # Quantize (JANG) vmlx info model # Model metadata vmlx doctor model # Run diagnostics vmlx bench model # Performance benchmark ``` Server flags: `--host`, `--port`, `--api-key`, `--continuous-batching`, `--enable-prefix-cache`, `--use-paged-cache`, `--kv-cache-quantization q8`, `--enable-disk-cache`, `--enable-jit`, `--tool-call-parser auto`, `--reasoning-parser auto`. ## Image Generation Built-in image generation running locally on Apple Silicon: | Model | Steps | Memory | Speed | |-------|-------|--------|-------| | Flux Schnell | 4 | ~12 GB | Fastest | | Flux Dev | 20 | ~24 GB | High quality | | Z-Image Turbo | 4 | ~12 GB | Creative | | Flux Klein 4B | 20 | ~8 GB | Compact | | Flux Klein 9B | 20 | ~16 GB | Mid-size | Models auto-download on first use. Custom models supported via HuggingFace ID or local path. ## Model Converter Built-in converter — no command line needed: **JANG format** (mixed-precision, architecture-aware): - 2S, 2M, 2L, 1L — 2-bit COMPRESS tier (for MoE models) - 3S, 3M, 3L — 3-bit COMPRESS tier - 4S, 4M, 4L — 4-bit COMPRESS tier (standard quality) - 6M — near-lossless - Custom — set CRITICAL/IMPORTANT/COMPRESS bits independently - JANG_4K on 122B: 94% MMLU at 69 GB (vs 90% for MLX 4-bit at 64 GB) - JANG_2S on 122B: 84% MMLU at 38 GB (vs 46% for MLX mixed_2_6 at 44 GB) - JANG_2L on MiniMax-M2.5 (230B): 74% MMLU at 82.5 GB (vs 26.5% for MLX 4-bit at 119.8 GB) — 3x higher score, 37 GB less RAM **Standard format** (standard): - Balanced 4-bit (recommended) - Quality 8-bit - Compact 3-bit - Custom bit width Also converts GGUF models to MLX format. ## JANG Quantization — Benchmark Results Architecture-aware mixed-precision quantization. Protects attention layers while compressing MLP. ### Qwen3.5-122B-A10B at ~2 bits (MMLU 50 questions, HumanEval 20 problems) | Method | Avg bits | Disk | GPU | MMLU | HumanEval | |--------|----------|------|-----|------|-----------| | JANG_2S (8,4,2) | 2.11 | 38 GB | 44 GB | 84% | — | | JANG_1L (8,8,2) | 2.24 | 51 GB | 46 GB | 73% | — | ### MiniMax-M2.5 (230B) at ~2 bits | Method | Avg bits | Disk | GPU | MMLU | HumanEval | |--------|----------|------|-----|------|-----------| | JANG_2L | ~2 | — | 82.5 GB | 74% | — | | MLX 4-bit | 4.0 | — | 119.8 GB | 26.5% | — | | 2-bit | 2.0 | 36 GB | 36 GB | 56% | — | | MLX mixed_2_6 | ~2.5 | 44 GB | 45 GB | 46% | — | ### Qwen3.5-35B-A3B (MMLU + HumanEval) | Method | MMLU | HumanEval | |--------|------|-----------| | MLX 4-bit | 82% | 19/20 (95%) | | JANG_4S (pipeline verification) | 82% | — | | JANG_2L v2 | 56% | pending | | MLX mixed_2_6 | 34% | 0/20 (0%) | Pipeline verified lossless: JANG_4S = MLX 4-bit exactly (82% = 82%). MLX mixed_2_6 produces zero working code on HumanEval (0/20). ## 20+ Agentic Coding Tools | Category | Tools | |----------|-------| | File I/O | read_file, write_file, edit_file, list_dir, copy, move, delete | | Code Search | grep (regex), glob (pattern match) | | Shell | execute_command | | Web Search | duckduckgo_search, brave_search | | URL Fetch | fetch_url (downloads and reads web pages) | | Git | git_status, git_diff, git_log, git_show | | Utilities | clipboard_read, clipboard_write, current_datetime | 14 tool call parsers auto-detect the right format per model (Qwen, Hermes, Llama, DeepSeek, Mistral, GLM, Nemotron, Step, MiniMax, etc.). Configurable: tool iterations, tool-choice modes (auto/required/none), working directory, MCP server connections. ## 5-Layer Caching Stack | Layer | What | Benefit | Others | |-------|------|---------|--------| | Prefix cache | Reuses KV for shared prompt prefixes | Near-instant TTFT on repeated prompts | oMLX has it, others don't | | Paged multi-context KV | Conversations stay cached across switches | No recomputation on context switch | LM Studio evicts | | KV quantization (q4/q8) | Compresses cache entries | 2-8x less cache memory | MLX Studio exclusive | | Continuous batching | 256 concurrent sequences | Serve multiple clients | oMLX, LM Studio 0.4.0 | | Persistent disk cache | Cache survives restarts | Instant warm start after reboot | oMLX has SSD cache | ## Feature Comparison | Feature | MLX Studio | oMLX | LM Studio | Inferencer | Ollama | |---------|-----------|------|-----------|------------|--------| | Image generation & editing | ✅ | ❌ | ❌ | ❌ | ❌ | | Anthropic Messages API | ✅ | ✅ | ❌ | ❌ | ❌ | | OpenAI Chat API | ✅ | ✅ | ✅ | ✅ | ✅ | | Responses API | ✅ | ✅ | ❌ | ❌ | ❌ | | Built-in model converter | ✅ (JANG + MLX + GGUF) | ❌ | ❌ | ❌ | ❌ | | JANG mixed-precision quant | ✅ | ❌ | ❌ | ❌ | ❌ | | Persistent disk cache | ✅ | ✅ | ❌ | ❌ | ❌ | | KV cache quantization | ✅ | ❌ | ❌ | ❌ | ❌ | | Prefix caching | ✅ | ✅ | Basic | ❌ | ❌ | | Paged multi-context KV | ✅ | Partial | ❌ | ❌ | ❌ | | Continuous batching | ✅ (256) | ✅ | ✅ | ❌ | ❌ | | Hybrid SSM/Mamba | ✅ | ❌ | ❌ | ❌ | ❌ | | Vision-language + caching | ✅ | Partial | ❌ | ❌ | ❌ | | 20+ agentic tools | ✅ | ❌ | ❌ | ❌ | ❌ | | 14 tool call parsers | ✅ | Some | Limited | ❌ | ❌ | | 4 reasoning parsers | ✅ | Basic | ❌ | ❌ | ❌ | | Embeddings API | ✅ | ✅ | ✅ | ❌ | ✅ | | Audio TTS/STT | ✅ | ❌ | ❌ | ❌ | ❌ | | Speculative decoding | ✅ | ❌ | ❌ | ❌ | ❌ | | API key auth | ✅ | ❌ | ❌ | ❌ | ❌ | | MCP (native + built-in) | ✅ | Client only | Client only | ❌ | ❌ | | Voice chat | ✅ | ❌ | ❌ | ❌ | ❌ | | HuggingFace browser | ✅ | ❌ | ✅ | ❌ | ❌ | | Remote endpoints + tools | ✅ | ❌ | ❌ | ❌ | ❌ | | Image generation API | ✅ | ❌ | ❌ | ❌ | ❌ | | Reranking API | ✅ | ❌ | ❌ | ❌ | ❌ | | JIT Metal kernel fusion | ✅ | ❌ | ❌ | ❌ | ❌ | | CLI (pip install) | ✅ | ❌ | ❌ | ❌ | ✅ | | Multi-session | ✅ | ❌ | ✅ | ❌ | ✅ | | Menu bar controls | ✅ | ❌ | ✅ | ❌ | ❌ | | 50+ architectures | ✅ | ✅ | ✅ | ✅ | ✅ | | Open source engine | ✅ | ✅ | ❌ | ❌ | ✅ | | Free | ✅ | ✅ | Freemium | Freemium | ✅ | ## Performance Hardware: Apple M3 Ultra, 256 GB unified memory. Model: Llama 3.2 3B 4-bit. | Context | MLX Studio Cold | MLX Studio Warm | LM Studio Cold | LM Studio Warm | |---------|----------------|----------------|----------------|----------------| | 2.5K | 0.50s | 0.05s (9.7x) | N/A | N/A | | 10K | 0.12s | 0.08s | 6.12s | 0.29s | | 100K | 0.65s | 0.45s | 131.06s | 1.14s | Cold processing at 100K: 154,121 tok/s vs LM Studio 686 tok/s (224x faster). ## Supported Models 50+ auto-detected architectures: Llama, Qwen (2.5, 3, 3.5, VL), DeepSeek (V3, R1), Gemma, Mistral, Phi, GLM, Nemotron-H, MiniMax, Jamba, Mamba, StableLM, SmolLM, and more. Pre-quantized JANG models: https://huggingface.co/JANGQ-AI ## Links - Website: https://mlx.studio - Download: https://mlx.studio/download - App GitHub: https://github.com/jjang-ai/mlxstudio - App Download: https://github.com/jjang-ai/mlxstudio/releases/latest - Engine GitHub: https://github.com/jjang-ai/vmlx (open source, pip install vmlx) - Engine PyPI: https://pypi.org/project/vmlx/ - JANG tools: https://github.com/jjang-ai/jangq - JANG website: https://jangq.ai - HuggingFace: https://huggingface.co/JANGQ-AI - X / Twitter: https://x.com/jangqai