It depends on how the parallelism is implemented, e.g. distributed data parallel...

		minimaxir on Dec 29, 2024 \| parent \| context \| favorite \| on: All You Need Is 4x 4090 GPUs to Train Your Own Mod... It depends on how the parallelism is implemented, e.g. distributed data parallel (DDP) to synchronize gradients: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html It's a rabbit hole I stay away from for pragmatic reasons.