Except for embarrassingly parallel problems, the trade-off of generating more parallelism is usually needing finer-grained communication. Canonical examples in the literature are matrix multiplication, triangulation, and refinement.
Large shared state is basically always the answer. You can cop-out and say use a database or Redis if that’s fast enough but that’s just making someone else use many threads with shared memory.