I'm totally new to AI. If I take for example LLaMa 3.1 (small size 8B), what's the rough budget to fine tune it against for example 1GB of extra text data, in any cloud GPU service? (if compute time is not a problem, I can wait)
Let's assume that the average token size in your 1GB file is 4 characters (which is the average that the OpenAI tokenizer generally will get; I assume the Llama tokenizer is similar). 4 chars is 4 bytes, assuming here that you're using UTF-8 and your characters are in the Latin range, so that means your training data is about 264MM tokens.
Let's assume you're doing a single-epoch LoRA training run. A single H100 should be enough to train Llama 3.1 8B, and it should crank through 264MM tokens in a couple hours, IMO. Since you're not doing multi-GPU training, a PCIe H100 should be fine — you don't need the slightly pricier SXM H100s — and the PCIe versions go for about $2.50/hr on Runpod.
So, about $5 for a custom model, that's probably the best in the world at whatever your task is! (Even if it might be a little dumber at other tasks.) Insanely cheap when you think about it.
TPUs won't beat H100s on price for on-demand personal use cases, but for reserved capacity (i.e. businesses) they're slightly cheaper.
You can dump in 1gb of data (Unsloth supports "raw text training") but whether you'd get good results or a useless model is a different issue. I doubt you'd get a good result unless you combine that with question/answer training as well, assuming that feature is even useful at all for your scenario.