> Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.
Also there seems to be insufficient testing before deployment with very junior level mistakes.
> As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:
Where was the testing for this one? If ANY exception happened during the rules checking, the deployment should fail and rollback. Instead, they didn't assess that as a likely risk and pressed on with the deployment "fix".
I guess those at Cloudflare are not learning anything from the previous disaster.
> Instead, it was triggered by changes being made to our body parsing logic while attempting to detect and mitigate an industry-wide vulnerability disclosed this week in React Server Components.
Doesn't Cloudflare rigorously test their changes before deployment to make sure that this does not happen again? This better not have been used to cover for the fact that they are using AI to fix issues like this one.
Better not be any presence of vibe coders or AI agents being used to be touching such critical pieces of infrastructure at all and I expected Cloudflare to learn from the previous outage very quickly.
But this is quite a pattern but might need to consider putting the unreliability next to GitHub (which goes down every week).
reply