Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How about on an MacBook Pro M2 Max with 64GB RAM? Any recommendations for local models for coding on that?

I tried to run some of the differently sized DeepSeek R1 locally when those had recently come out, but couldn’t manage at the time to run any of them. And I had to download a lot of data to try those. So if you know a specific size of DeepSeek R1 that will work on 64GB RAM on MacBook Pro M2 Max, or another great local LLM for coding on that, that would be super appreciated



I imagine that this in quantized form would fit pretty well and be decent. (Qwen R1 32b[1] or Qwen 3 32b[2])

Specifically the `Q6_K` quant looks solid at ~27gb. That leaves enough headroom on your 64gb Macbook that you can actually load a decent amount of context. (It takes extra VRAM for every token of context you need)

Rough math, based on this[0] calculator is that it's around ~10gb per 32k tokens of context. And that doesn't seem to change based on using a different quant size -- you just have to have enough headroom.

So with 64gb:

- ~25gb for Q6 quant

- 10-20gb for context of 32-64k

That leaves you around 20gb for application memory and _probably_ enough context to actually be useful for larger coding tasks! (It just might be slow, but you can use a smaller quant to get more speed.)

I hope that helps!

0: https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calcul...

1: https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-32...

2: https://huggingface.co/Qwen/Qwen3-32B-GGUF


I really like Mistral Small 3.1 (I have a 64GB M2 as well). Qwen 3 is worth trying in different sizes too.

I don't know if they'll be good enough for general coding tasks though - I've been spoiled by API access to Claude 3.7 Sonnet and o4-mini and Gemini 2.5 Pro.


How do you determine peak memory usage? Just look at activity monitor?

I've yet to find a good overview of how much memory each model needs for different context lengths (other than back of the envelope #weights * bits). LM Studio warns you if a model will likely not fit, but it's not very exact.


MLX reports peak memory usage at the end of the response. Otherwise I'll use Activity Monitor.


I'm also trusting `get_peak_memory` + some small buffer for now.

Still, it reports accurate peak memory usage for tensors living on GPU, but seems to miss some of the non-Metal overhead, however small (https://github.com/aukejw/mlx_transformers_benchmark/issues/...).


There are plenty of smaller (quantized) models that fit well on your machine! On a M4 with 24GB it’s already possible to comfortably run 8B quantized models.

Im benchmarking runtime and memory usage for a few of them: https://aukejw.github.io/mlx_transformers_benchmark/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: