You don't train the next model by starting with the previous one.
A company's ML researchers are constantly improving model architecture. When it's time to train the next model, the "best" architecture is totally different from the last one. So you have to train from scratch (mostly... you can keep some small stuff like the embeddings).
The implication here is that they screwed up bigly on the model architecture, and the end result was significantly worse than the mid-2024 model, so they didn't deploy it.
I can not say how big ML companies do it, but from personal experience of training vision models, you can absolutely reuse the weights of barely related architectures (add more layers, switch between different normalization layers, switch between separable/full convolution, change activation functions, etc.). Even if the shapes of the weights do not match, just do what you have to do to make them fit (repeat or crop). Of course the models will not work right away, but training will go much faster. I usually get over 10 times faster convergence that way.
It’s possible the model architecture influences the effectiveness of utilizing pretrained weights. i.e. cnns might be a good fit for this since the first portion is the feature extractor, but you might scrap the decoder and simply retrain that.
Can’t say whether the same would work with Transformer architecture, but I would guess there are some portions that could potentially be reused? (there still exists an encoder/feature extraction portion)
If you’re reusing weights from an existing model, then it seems it becomes more of a “fine-tuning” exercise as opposed to training a novel foundational model.
Huh - I did not know that, and that makes a lot of sense.
I guess "Start software Vnext off the current version (or something pretty close)" is such a baseline assumption of mine that it didn't occur to me that they'd be basically starting over each time.
A company's ML researchers are constantly improving model architecture. When it's time to train the next model, the "best" architecture is totally different from the last one. So you have to train from scratch (mostly... you can keep some small stuff like the embeddings).
The implication here is that they screwed up bigly on the model architecture, and the end result was significantly worse than the mid-2024 model, so they didn't deploy it.