Double-click vMLX-1.2.1-arm64.dmg to mount it. The app is code-signed and notarized by Apple
— no Gatekeeper warnings.
Drag the vMLX icon to the Applications folder. Close the DMG window and eject the disk image.
Open vMLX from Applications or Spotlight. On first launch, it will offer to install the MLX inference engine with one click.
Search and download any MLX model from HuggingFace directly in the app — or point it at models you already have. We publish optimized models at huggingface.co/dealignai.
Create a session, hit Start, and you're running a local AI inference server with prefix caching, paged
KV, voice, vision, and 20+ agentic tools. An OpenAI-compatible API is available at
http://127.0.0.1:8000.
More unified memory = larger models. 16 GB handles up to ~20B parameters, 32 GB handles ~35B, 64 GB handles ~70B, and 192 GB handles 400B+ MoE models. vMLX's KV cache quantization (q4/q8) lets you push these limits further.
vMLX is a single self-contained app. The DMG includes everything you need — no Python, pip, Docker, or command-line setup required. On first launch, it automatically installs the MLX inference engine into a bundled environment.
Features: multi-model sessions, streaming chat, 5-layer caching (prefix + paged KV + q4/q8 quantization + batching + disk), Mamba/SSM support, 50+ auto-detected architectures, 14 tool call parsers, 4 reasoning parsers, 20+ agentic tools (file, shell, git, web search, browser), voice chat, vision/multimodal, embeddings, benchmarks, and an OpenAI-compatible API.