The key difference is that MLX's array model assumes unified memory from the gro...

LuxBennu · 2026-04-01T07:25:48 1775028348

that tracks with what i've noticed practically. shorter prompts feel basically the same between llama.cpp metal and what i'd expect from native mlx, but once context gets longer the overhead starts showing up. would be interesting to see if ollama's mlx path actually handles kv cache differently under the hood or if it just skips the buffer sync layer

zozbot234 · 2026-04-01T08:06:37 1775030797

If it's just about skipping some buffer sync that's something that could also be adopted by llama.cpp's own Metal backend, at least on Apple Silicon platforms.

lioeters · 2026-03-31T20:38:36 1774989516

Insightful comment, thanks!