looks like they're too busy being awesome. i need a fake video to understand thi...

brucethemoose2 · on Dec 8, 2023

Looks like it will squeeze into 24GB once the llama runtimes work it out.

Its also a good candidate for splitting across small GPUs, maybe.

One architecture I can envision is hosting prompt ingestion and the "host" model on the GPU and the downstream expert model weights on the CPU /IGP. This is actually pretty efficient, as the CPU/IGP is really bad at the prompt ingestion but reasonably fast at ~14B token generation.

Llama.cpp all but already does this, I'm sure MLC will implement it as well.

syntaxing · on Dec 8, 2023

BitTorrent was the craze when llama was leaked on torrent. Then Facebook started taking down all huggingface repos and a bunch of people transitioned to torrent released temporarily. llama 2 changed all this but it was a fun time.