You don't train the next model by starting with the previous one. A company's ML...

threeducks · 2025-12-03T13:52:01 1764769921

I can not say how big ML companies do it, but from personal experience of training vision models, you can absolutely reuse the weights of barely related architectures (add more layers, switch between different normalization layers, switch between separable/full convolution, change activation functions, etc.). Even if the shapes of the weights do not match, just do what you have to do to make them fit (repeat or crop). Of course the models will not work right away, but training will go much faster. I usually get over 10 times faster convergence that way.

sota_pop · 2025-12-04T15:54:56 1764863696

It’s possible the model architecture influences the effectiveness of utilizing pretrained weights. i.e. cnns might be a good fit for this since the first portion is the feature extractor, but you might scrap the decoder and simply retrain that.

Can’t say whether the same would work with Transformer architecture, but I would guess there are some portions that could potentially be reused? (there still exists an encoder/feature extraction portion)

If you’re reusing weights from an existing model, then it seems it becomes more of a “fine-tuning” exercise as opposed to training a novel foundational model.

MikeTheGreat · 2025-12-03T01:29:04 1764725344

Huh - I did not know that, and that makes a lot of sense.

I guess "Start software Vnext off the current version (or something pretty close)" is such a baseline assumption of mine that it didn't occur to me that they'd be basically starting over each time.

Thanks for posting this!