Don't have a GPU so tried the CPU option and got 0.6t/s on my old 2018 laptop us...

UncleOxidant · 2026-04-01T03:04:19 1775012659

Are you getting anything besides gibberish out of it? I tried their recommended commandline and it's dog slow even though I built their llama.cpp fork with AVX2 enabled. This is what I get:

    $ ./build/bin/llama-cli     -hf prism-ml/Bonsai-8B-gguf -p "Explain quantum computing in simple terms." -n 256 --temp 0.5 --top-p 0.85 --top-k 20 -ngl 99
    > Explain quantum computing in simple terms.

     \( ,

      None ( no for the. (,./. all.2... the                                                                                                                                ..... by/

EDIT: It runs fine in their collab notebook. Looking at that you have to do: git checkout prism (in the llama.cpp repo) before you build. That's a missing instruction if you're going straight to their fork of llama.cpp. Works fine now.

UncleOxidant · 2026-04-01T23:06:03 1775084763

UPDATE: I was using the llama.cpp CPU backend and was still getting gibberish. On Google colab they're running with CUDA. I turned Claude loose on the problem and it discovered a problem in the llama.cpp CPU backend code where a float was being converted to an int and basically going to 0. Now it runs fine locally with the CPU backend.

gorgonical · 2026-04-02T08:58:38 1775120318

Mind sharing the fix as a patch? I would like to run it this way, too.

UncleOxidant · 2026-04-02T18:15:51 1775153751

https://github.com/philtomson/llama.cpp

gorgonical · 2026-04-10T14:38:31 1775831911

Works like a charm for me, just like for you. Getting ~8 tok/s on my middle-of-the-road workstation laptop. Thanks for this!

cubefox · 2026-04-01T02:54:37 1775012077

"Not shabby" is a big understatement.

ddtaylor · 2026-04-01T03:34:08 1775014448

Why so?

boxedemp · 2026-04-01T04:01:42 1775016102

Because it's the opposite of shabby

ddtaylor · 2026-04-02T21:40:10 1775166010

What are the reasons?