The more VRAM the better if you'd like to run larger LLMs. Old Nvidia P40 (Pasca...

The more VRAM the better if you'd like to run larger LLMs. Old Nvidia P40 (Pascal 24GB) cards are easily available for $200 or less and would be easy/cheap to play. Here's a recent writeup on the LLM performance you can expect for inferencing (training speeds I assume would be similar): https://www.reddit.com/r/LocalLLaMA/comments/13n8bqh/my_resu...

This repo lists very specific VRAM usage for various LLaMA models (w/ group size, and accounting for context window, which is often missing) - this are all 4-bit GPTQ quantized models: https://github.com/turboderp/exllama

Note the latest versions of llama.cpp now have decent GPU support and has both a memory tester and lets you load partial models (n-layers) into your GPU. It inferences about 2X slower than exllama from my testing on a RTX 4090, but still about 6X faster than my CPU (Ryzen 5950X).

Again this is inferencing. For training, pay attention to 4-bit bitsandbytes, coming soon: https://twitter.com/Tim_Dettmers/status/1657010039679512576