Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> I can run it on my Macbook Air at 12tkps, can't wait to try this on my desktop.

That seems kinda low, are you using Metal GPU acceleration with llama.cpp? I don't have a macbook, but saw some of the llama.cpp benchmarks that suggest it can reach close to 30tk/s with GPU acceleration.



Thanks for the tip. I'm on the M2 Air with 16 GB's of ram.

If anyone has faster than 12tkps on Air's let me know.

I'm using the LM Studio GUI over llama.cpp with the "Apple Metal GPU" option. Increasing CPU threads seemingly does nothing either without metal.

Ram usage hovers at 5.5GB with a q5_k_m of Mistral.


Try different quantization variations. I got vastly different speeds depending on which quantization I chose. I believe q4_0 worked very well for me. Although for a 7B model q8_0 runs just fine too with better quality.


LlamaFile typically outperforms LM Studio and even Ollama.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: