A separate comment about conclusions about why they are worse than OpenAI GPT2 -...

gpjt · 2025-12-09T15:30:18 1765294218

OP here -- thanks! I'm in the process of doing some trains using the same code plus DDP on big Lambda Labs machines, and (within the bounds of what I can afford) will hopefully have some interesting results about all of those shortly.

gpjt · 2025-12-09T19:37:54 1765309074

OK, early indicators support both you and Gemini quite strongly re: batch size. On my (somewhat ad-hoc) test dataset, I get losses like this:

  * OpenAI medium weights: 3.231
  * OpenAI small weights: 3.500
  * My locally trained model, FineWeb Chinchilla, batch size 6: 3.944
  * My locally trained model, FineWeb-Edu Chinchilla, batch size 6: 4.167
  * My locally trained model, FineWeb-Edu double Chinchilla, batch size 6: 4.135
  * My cloud trained model, FineWeb Chinchilla, batch size 13 \* 8 = 104: 3.674

That last one was trained on an 8x A100 machine with 40 GiB per GPU, with the same code as before, just converted to DDP. It certainly looks like the much larger batch size has improved the model significantly.

I'll be trying on larger machines. No gradient accumulation yet, but it's certainly looking like a valuable lever to pull for local training runs (and, I suspect, might also be useful on "small" cloud machines like the one I used -- will have to see what things look like with the bigger mini-batches I can squeeze onto 80 GiB and 160 GiB GPUs).

spi · 2025-12-10T09:31:28 1765359088

Thanks, very nice to see these results! Certainly using GPUs with more RAM makes things simpler to scale. Gradient accumulation is as easy as adding a counter for number of steps and an "if counter % gradient_accumulation_steps:` around `optimizer.step()`, so that can also be tried simply on a single GPU / cheaper GPUs. But if you can just use 8xA100 and your pipeline parallizes well, you also get results (almost) 8 times faster, which is certainly nicer to experiment of course!

gpjt · 2025-12-10T13:41:22 1765374082

Exactly! If I can get it down to an hour or two (seems very plausible on an 8x H200 with 160 GiB VRAM per GPU, though those are almost never available on Lambda Labs), I'll do the experiments with dropout and the other possible causes of issues, then see if I can bake that all into a new train on the RTX 3090 and confirm it repros there. Looks like I'll definitely need gradient accumulation there.

I assume the zero_grad would need to go in the same if block?

gpjt · 2025-12-11T01:57:42 1765418262

Hmm, interesting. With a batch size of 512 (8x B200s with 160 GiB each) I get worse results! Maybe there's a sweet spot somewhere in between.

spi · 2025-12-12T13:43:00 1765546980

Sorry came a bit late to this reply. Interesting, well, nobody says it's a monotonic function :-) in the limit of _very_ large batches you of course are worse off, because you take a very large amount of computation before taking a single step, so if you stop after a fixed amount of time your model just didn't have the time to learn properly. So certainly there is a sweet spot somewhere.

I suppose, the real "function" is a bit more complicated because (1) If you put 2x more data through the same GPU with large enough memory, it will take less than 2x the time to compute (but certainly not 1x). (2) At some point, empirically, increasing batch size makes it _worse_ even if you ignore the additional runtime cost (i.e. stop after n gradient update steps, and not x seconds). To my knowledge, the accepted reason for that fact is that a bit of noise helps in regularizing learning, because overly smooth learning curves end up stagnating in local loss minima more easily. In truth, I think nobody exactly understand how deep learning models work :-)

And to your other question - sorry again for the late answer. Yes, `optimizer.zero_grad()` should always be called directly after `optimizer.step()`, therefore with gradient accumulation once every `n` steps (otherwise, you'd be zeroing out the gradients, so just throwing away all the compute you did in previous steps).

gpjt · 2025-12-12T14:51:31 1765551091

Thanks re: gradient accumulation, I'm glad to hear my intuition was right!

As part of the upcoming post I'm running the DDP train on A100s with 40 GiB and 80 GiB, H100s with 80 GiB, and B200s with 160 GiB, so I'll have at least three loss vs. batch size points to plot. So that might be interesting.

I guess a full test would be to train at various batch sizes on the 160 GiB machine and plot the resulting loss. That would be very expensive as a hobby project (the bs=64 train cost a bit more than $40 excluding overhead) so I won't do it.

But perhaps a shorter train would still be of value? That is, train for 300M tokens for a tenth of the cost and see where the loss landed? The problem with that would be if the impact of batch sizes varied with the length of the train, eg. if batch size 64 was better than 512 for short trains but weaker at longer ones.

spi · 2025-12-13T20:40:30 1765658430

Yes exactly, I fear that shortening the training time would skew the results. In the very short term, smaller batch size is typically better just because you need a certain amount of gradient updates to move away from the original random, hence pretty terrible, weight distribution. Larger batch size gives a steadier, but slower, convergence, so it's hard to say for sure what is better for a given compute budget.

I'm definitely _not_ encouraging you on spending more money on a side topic just for the sake of optimizing this one parameter, there will always be another parameter after that that you'll feel an urge to optimize :-) I'd say it's already a pretty neat result to have come to a very close score to the original GPT2 training starting from scratch!

P.S. If you want to push it a bit further, rather than optimizing parameters for this model, last week at EurIPS I heard that a current "very good" modern repo to start from in order to train a good LLM is this: https://github.com/Niccolo-Ajroldi/plainLM. I haven't investigated this exactly (I'm not working on LLM), but it might be interesting to you for a sample run. The (N)EurIPS paper that was discussed at the conference claimed that the only important change to do was to modify the hyperparameters of the Adam optimizer, setting beta1=beta2=0.95 for example (the default values are beta1=0.9 and beta2=0.999 which are apparently outdated).

whimsicalism · 2025-12-09T17:38:17 1765301897

> Nowadays models are trained with effective batch size of millions of tokens in total. Of course, this won't fit into memory, one uses gradient accumulations to that purpose, again as mentioned by Gemini.

I would be surprised if there is much/any gradient acc in modern large-scale pretraining runs. You can always just recruit more GPUs with DP/PP/TP rather than training for longer.

alansaber · 2025-12-09T15:33:02 1765294382

To caveat, smaller batch sizes are generally better for model stability, but we go bigger because it substantially speeds up training

spi · 2025-12-10T09:44:42 1765359882

Mmh not really. As OP shows, speed increases with larger batch size, but only initially, until the GPU has high enough utilization; then speed improvements flatten out (although you might get OOM before that and not "really" see the flat part). Using smaller batch size increases _noise_, so quite literally decreases stability. That might be good sometimes: in the limit case, if the batch is as large as your training set, you'll end up in local minima and not be able to get out of it. But this is true for toy datasets like MNIST, here it's an entirely different beast.

With such large corpora as the ones used here, and very noisy ones at that, gradient updates are very noisy and that can harm quality. Or anyway, common lore is that one needs pretty large batch size to have the language model improve steadily.

alansaber · 2025-12-10T16:40:32 1765384832

Are you sure about the top-cap on batch size for speed? See https://arxiv.org/pdf/1904.00962

spi · 2025-12-12T13:37:23 1765546643

Sorry I just opened that file now, and browsed through it very quickly, but my eye fell on the excerpt: ``` However, we did not observe any speedup by increasing the batch size from 65536 to 131072 for the first stage, thus, we restrict the batch size to 65536 for this stage. ``` which I think is more or less my point: increasing batch size essentially always helps, but the speedup reduces the more you push the batch size. Provided that your dataset is large enough, more batch size will always make you run a bit faster without sacrificing accuracy, but the speedup will be less and less as you increase the batch size, until you are anyway maxing out the power of your GPU and you can't see any measurable speedup anymore.