Actually is purely bandwidth-bound, the major bottleneck of the whole process, for me in this case, is the B450 mobo I got that's only capable of pcie3 and 1x8 in the pcie lanes for gpu instead of 1x16; so I'm capped until I get an X570 maybe. I should get around twice or triple the tok speed with that upgrade alone
I updated the documentation to provide more info for the patching process, I added the patches themselves too and provided some risk info about the patches
I did it, but with different quantization compressions, It ran into quality issues, I will try to rerun with the same quants if that fixes the issue, but the most that looks unused, its being used by rotating layers that are being swapped by the cpu from the ram itself, that manages to keep layers warm, ready to use while inferencing and discarding already used ones
The idea was basically to run a llm on a ps2, then I ran into some problems as the 32mb ram cap with 4mb vram cap; so I had to figure out a way to stream layers on the forward pass. Given that ps2 manages to give instructions directly to the vram that's capable of 32bit addresses, it gave an insane amount of tok/s, then I wondered if I could do the same on my puter
nvmlib/ssd-gpu-dma and BaM (based on the same code base) are pretty cool as they allow you to initiate disk reads/writes directly from a CUDA kernel (so not only reading/writing directly to GPU memory but also allowing the GPU to initiate IO on its own). Sometimes called GPU-initiated I/O or accelerator-initiated I/O.