How will this deal with workflow versioning? IMO the hardest problem in e.g. Tem...

bkolobara · on Oct 25, 2023

Hi! Author here.

I have spent a lot of time thinking about this and believe that the most straight forward solution for long running (or even forever running) workflows is to allow hot upgrades.

A hot upgrade would only succeed if you can exactly replay the existing side effect log history with the new code. Basically you do a catchup with the new code and just keep running once you catch up. If the new code diverges, the hot upgrade would fail and revert to the old one. In this case, a human would need to intervene and check what went wrong.

There are other approaches, but I feel like this is the simples one to understand and use in practice. During development you can already test if your code diverges, using existing logs.

burntcaramel · on Oct 25, 2023

Is this the problem where you have workflows live and in progress, yet you want to update the workflow and have things not break? Would you want to have multiple versions at once, or rather some way to migrate the in progress ones to the latest workflow definition?

spiffytech · on Oct 25, 2023

The latter.

For short lived workflows, you may not care about updating; just let it finish.

For longer jobs, you want some way to replace the current logic and either resume from where the job left off, or restart it idempotently. Especially if your workflow spans months or years (which at least some of these systems are designed for).

The challenge is these systens shine when you manage the job state in-memory, but they don't "store" the data in a traditional sense. They just replay your logic and replay the original I/O results. So if your logic changes, the replay breaks and your state goes bye-bye.

(I think of it similarly to React's "rules of hooks": you can't do anything that makes the function call key APIs in a different order than previous executions)

So you either accept that you can never update an in-flight job (in a meaningful way, at least), or you track job state in some other system and throw away the distinguishing feature of these systems.

I'm curious how people normally handle this. When I worked with Azure Durable Functions I couldn't find a way around this.

joshfee · on Oct 25, 2023

I think the problem works both ways too, because for an in progress (but long-running) workflow, you may not want to retroactively apply your new business logic for the part that's already run because that would be unexpected. But for the logic that hasn't run yet, you would certainly want the latest and greatest.

I wonder if there could be an approach where you have both versions live simultaneously, and introduce some sort of "checkpoint" into the old version that would act similar to a DB migration. When re-computing a workflow you could then start from the latest checkpoint, but any workflows that were created with the old version that haven't reached a checkpoint would continue to run the old code until it does.

jen20 · on Oct 25, 2023

You’d likely want either option depending on the lifetime.

jiehong · on Oct 25, 2023

True. That’s one thing well handled by conductor instead.