Hacker Newsnew | past | comments | ask | show | jobs | submit | flaminHotSpeedo's commentslogin

Are you thinking of the Missouri department of education's teacher directory website?

https://krebsonsecurity.com/2022/02/report-missouri-governor...

Luckily someone eventually talked sense into the governor, despite him ignoring the FBI originally when they told him it wasn't a hack


Actually, it's really important to me to have a network of atomic clocks available to verify the times I clock in and out, I want to make sure I get paid for an accurate duration of time down to the nanosecond


> 103 drivers (41.9%) overall tested positive for THC, with yearly rates ranging from 25.7% to 48.9%.

The statistics for this seem suspect at best, I'll believe it once it's peer reviewed


> Researchers analyzed coroner records from Montgomery County in Ohio from January 2019 to September 2024, focusing on 246 deceased drivers who were tested for THC following a fatal crash.

This paper would need to go into way more detail to be at all useful.

40% is a staggering number, which makes me suspect that all it measures is Montgomery County police's pretty good track record for deciding when to test someone for THC during an autopsy


Containers are never a security boundary. If you configure them correctly, avoid all the footguns, and pray that there's no container escape vulnerabilities that affect "correctly" configured containers then they can be a crude approximation of a security boundary that may be enough for your use case, but they aren't a suitable substitute for hardware backed virtualization.

The only serious company that I'm aware of which doesn't understand that is Microsoft, and the reason I know that is because they've been embarrassed again and again by vulnerabilities that only exist because they run multitenant systems with only containers for isolation


Virtual machines are never a security boundary. If you configure them correctly, avoid all the footguns, and pray that there's no VM escape vulnerabilities that affect "correctly" configured VMs then they can be a crude approximation of a security boundary that may be enough for your use case, but they aren't a suitable substitute for entirely separate hardware.

Its all turtles, all the way down.


Yeah, in some (rare) situations physical isolation is a more appropriate level of security. Or if you want to land somewhere in between, you can use VM's with single tenant NUMA nodes.

But for a typical case, VM's are the bare minimum to say you have a _secure_ isolation boundary because the attack surface is way smaller.


Yeah, so secure.

https://support.broadcom.com/web/ecx/support-content-notific...

https://nvd.nist.gov/vuln/detail/CVE-2019-5183

https://nvd.nist.gov/vuln/detail/CVE-2018-12130

https://nvd.nist.gov/vuln/detail/CVE-2018-2698

https://nvd.nist.gov/vuln/detail/CVE-2017-4936

In the end you need to configure it properly and pray there's no escape vulnerabilities. The same standard you applied to containers to say they're definitely never a security boundary. Seems like you're drawing some pretty arbitrary lines here.


They kinda buried the lede there, 28% failure rate for 100% of customers isn't the same as 100% failure rate for 28% of customers


What's the culture like at Cloudflare re: ops/deployment safety?

They saw errors related to a deployment, and because it was related to a security issue instead of rolling it back they decided to make another deployment with global blast radius instead?

Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.

Pure speculation, but to me that sounds like there's more to the story, this sounds like the sort of cowboy decision a team makes when they've either already broken all the rules or weren't following them in the first place


One thing to keep in mind when judging what's 'appropriate' is that Cloudflare was effectively responding to an ongoing security incident outside of their control (the React Server RCE vulnerability). Part of Cloudlfare's value proposition is being quick to react to such threats. That changes the equation a bit: any hour you wait longer to deploy, your customers are actively getting hacked through a known high-severity vulnerability.

In this case it's not just a matter of 'hold back for another day to make sure it's done right', like when adding a new feature to a normal SaaS application. In Cloudflare's case moving slower also comes with a real cost.

That isn't to say it didn't work out badly this time, just that the calculation is a bit different.


To clarify, I'm not trying to imply that I definitely wouldn't have made the same decision, or that cowboy decisions aren't ever the right call.

However, this preliminary report doesn't really justify the decision to use the same deployment system responsible for the 11/18 outage. Deployment safety should have been the focus of this report, not the technical details. My question that I want answered isn't "are there bugs in Cloudflare's systems" it's "has Cloudflare learned from it's recent mistakes to respond appropriately to events"


> doesn't really justify the decision to use the same deployment system responsible for the 11/18 outage

There’s no other deployment system available. There’s a single system for config deployment and it’s all that was available as they haven’t yet done the progressive roll out implementation yet.


> There’s no other deployment system available.

Hindsight is always 20/20, but I don't know how that sort of oversight could happen in an organization whose business model rides on reliability. Small shops understand the importance of safeguards such as progressive deployments or one-box-style deployments with a baking period, so why not the likes of Cloudflare? Don't they have anyone on their payroll who warns about the risks of global deployments without safeguards?


There was another deployment system available. The progressive one used to roll out the initial change, which presumably rolls back sanely too.


Ok. Sure But shouldn't they have some beta/staging/test area they could deploy to, run tests for an hour then do the global blast?


Config changes are distinctly more difficult to have that set up for and as the blog says they’re working on it. They just don’t have it ready yet and are pausing any more config changes until it’s set up. They just did this one in response to try to mitigate an ongoing security vulnerability and missed the mark.

I’m happy to see they’re changing their systems to fail open which is one of the things I mentioned in the conversation about their last outage.


The 11/18 outage was 2.5 weeks ago. Any learning & changes they made as a result for that probably didn't make its way yet to production.

Particularly if we're asking them to be careful & deliberate about deployments, hard to ask them fast-track this.


the cve isn't a zero day though how come cloudflare werent at the table for early disclosure?


Do you have a public source about an embargo period for this one? I wasn't able to find one


https://react.dev/blog/2025/12/03/critical-security-vulnerab...

Privately Disclosed: Nov 29 Fix pushed: Dec 1 Publicly disclosed: Dec 3


Then even in the worst case scenario, they were addressing this issue two days after it was publicly disclosed. So this wasn't a "rush to fix the zero day ASAP" scenario, which makes it harder to justify ignoring errors that started occuring in a small scale rollout.


Considering there were patched libraries at the time of disclosure, those libraries' authors must have been informed ahead of time.


Cloudflare did have early access, and had mitigation in place from the start. The changes that were being rolled out were in response to ongoing attempts to bypass those.

Disclosure: I work at Cloudflare, but not on the WAF


Cloudflare had already decided this was a rule that could be rolled out using their gradual deployment system. They did not view it as being so urgent that it required immediate global roll out.


[flagged]


Indeed, but it is what it is. Cloudflare comes out of my budget, and even with downtime, its better than not paying them. Do I want to deal with what Cloudflare offers? I do not, I have higher value work to focus on. I want to pay someone else to deal with this, and just like when cloud providers are down, it'll be back up eventually. Grab a coffee or beer and hang; we aren't savings lives, we're just building websites. This is not laziness or nihilism, but simply being rational and pragmatic.


> Do I want to deal with what Cloudflare offers? I do not, I have higher value work to focus on. I want to pay someone else to deal with this, and just like when cloud providers are down, it'll be back up eventually.

This is specious reasoning. How come I had to endure a total outage due to the rollout of a mitigation of a Nextjs vulnerability when my organization doesn't even own any React app, let alone a Nextjs one?

Also specious reasoning #2, not wanting to maintain a service does not justify blindly rolling out config changes globally without any safeguards.


If you are a customer of Cloudflare, and not happy, I encourage you to evaluate other providers more to your liking. Perhaps you'll find someone more fitting to your use case and operational preferences, but perhaps not. My day job org pays Cloudflare hundreds of thousands of dollars a year, and am satisfied with how they operate. Everyone has choice, exercise it if you choose. I'm sure your account exec would be happy to take the feedback. Feedback, including yours, is valuable and important to attempt to improve the product and customer experience (imho; i of course do not speak for Cloudflare, only myself).

As a recovering devops/infra person from a lifetime ago (who has, much to my heartbreak, broken prod more than once), perhaps that is where my grace in this regard comes from. Systems and their components break, systems and processes are imperfect, and urgency can lead to unexpected failure. Sometimes its Cloudflare, other times it's Azure, GCP, Github, etc. You can always use something else, but most of us continue to pick the happy path of "it works most of the time, and sometimes it does not." Hopefully the post mortem has action items to improve the safeguards you mention. If there are no process and technical improvements from the outage, certainly, that is where the failure lies (imho).

China-nexus cyber threat groups rapidly exploit React2Shell vulnerability (CVE-2025-55182) - https://aws.amazon.com/blogs/security/china-nexus-cyber-thre... - December 4th, 2025

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...


> you are a customer of Cloudflare, and not happy, I encourage you to evaluate other providers more to your liking.

I think your take is terribly simplistic. In a professional setting, virtually all engineers have no say on whether the company switches platforms or providers. Their responsibility is to maintain and develop services that support business. The call to switch a provider is ultimately a business and strategic call, and is a subject that has extremely high inertia. You hired people specialized in technologies, and now you're just dumping all that investment? Not to mention contracts. Think about the problem this creates.

Some of you sound like amateurs toying with pet projects, where today it's framework A on cloud provider X whereas tomorrow it's framework B on cloud provider Y. Come the next day, rinse and repeat. This is unthinkable in any remotely professional setting.


> Some of you sound like amateurs toying with pet projects, where today it's framework A on cloud provider X whereas tomorrow it's framework B on cloud provider Y. Come the next day, rinse and repeat. This is unthinkable in any remotely professional setting.

Vendor contracts have 1-3 year terms. We (a financial services firm) re-evaluate tech vendors every year for potential replacement and technologists have direct input into these processes. I understand others may operate under a different vendor strategy. As a vendor customer, your choices are to remain a customer or to leave and find another vendor. These are not feelings, these are facts. If you are unhappy but choose not to leave a vendor, that is a choice, but it is your choice to make, and unless you are a large enough customer that you have leverage over the vendor, these are your only options.


Rollback is a reliable strategy when the rollback process is well understood. If a rollback process is not well known and well experienced, then it is a risk in itself.

I'm not sure of the nature of the rollback process in this case, but leaning on ill-founded assumptions is a bad practice. I do agree that a global rollout is a problem.


Rollback carries with it the contextual understanding of complete atomicity; otherwise it's slightly better than a yeet. It's similar to backups that are untested.


Complete atomicity carries with it the idea that the world is frozen, and any data only needs to change when you allow it to.

That's to say, it's an incredibly good idea when you can physically implement it. It's not something that everybody can do.


No, complete atomicity doesn't require a frozen state, it requires common sense and fail-proof, fool-proof guarantees derived from assurances gained from testing.

There is another name for rolling forward, it's called tripping up.


Global rollout of security code on a timeframe of seconds is part of Cloudflare's value proposition.

In this case they got unlucky with an incident before they finished work on planned changes from the last incident.


That's entirely incorrect. For starters, they didn't get unlucky. They made a choice to use the same system they knew was sketchy (which they almost certainly knew was sketchy even before 11/18)

And on top of that, Cloudflare's value proposition is "we're smart enough to know that instantaneous global deployments are a bad idea, so trust us to manage services for you so you don't have to rely on in house folks who might not know better"


> They saw errors related to a deployment, and because it was related to a security issue instead of rolling it back they decided to make another deployment with global blast radius instead?

Note that the two deployments were of different components.

Basically, imagine the following scenario: A patch for a critical vulnerability gets released, during rollout you get a few reports of it causing the screensaver to show a corrupt video buffer instead, you roll out a GPO to use a blank screensaver instead of the intended corporate branding, a crash in a script parsing the GPOs on this new value prevents users from logging in.

There's no direct technical link between the two issues. A mitigation of the first one merely exposed a latent bug in the second one. In hindsight it is easy to say that the right approach is obviously to roll back, but in practice a roll forward is often the better choice - both from an ops perspective and from a safety perspective.

Given the above scenario, how many people are genuinely willing to do a full rollback, file a ticket with Microsoft, and hope they'll get around to fixing it some time soon? I think in practice the vast majority of us will just look for a suitable temporary workaround instead.


Roll back is not always the right answer. I can’t speak to its appropriateness in this particular situation of course, but sometimes “roll forward” is the better solution.


Like the other poster said, roll back should be the right answer the vast majority of the time. But it's also important to recognize that roll forward should be a replacement for the deployment you decided not to roll back, not a parallel deployment through another system.

I won't say never, but a situation where the right answer to avoid a rollback (that it sounds like was technically fine to do, just undesirable from a security/business perspective) is a parallel deployment through a radioactive, global blast radius, near instantaneous deployment system that is under intense scrutiny after another recent outage should be about as probable as a bowl of petunias in orbit


Is a roll back even possible at Cloudflare's size?

With small deployments it usually isn't too difficult to re-deploy a previous commit. But once you get big enough you've got enough developers that half a dozen PRs will have been merged since the start of the incident and now. How viable is it to stop the world, undo everything, and start from scratch any time a deployment causes the tiniest issues?

Realistically the best you're going to get is merging a revert of the problematic changeset - but with the intervening merges that's still going to bring the system in a novel state. You're rolling forwards, not backwards.


Disclosure: Former Cloudflare SRE.

The short answer is "yes" due to the way the configuration management works. Other infrastructure changes or service upgrades might get undone, but it's possible. Or otherwise revert the commit that introduced the package bump with the new code and force that to rollout everywhere rather than waiting for progressive rollout.

There shouldn't be much chance of bringing the system to a novel state because configuration management will largely put things into the correct state. (Where that doesn't work is if CM previously created files, it won't delete them unless explicitly told to do so.)


> service upgrades might get undone, but it's possible.

But who knows what issues might reverting other team's stuff bring?


If companies like Cloudflare haven't figured out how to do reliable rollbacks, there seems little hope for any of us.


I'd presume they have the ability to deploy a previous artifact vs only tip-of-master.


That will depend on how you structure your deployments, on some large tech companies, while thousands of changes little are made every hour, and deployments are mande in n-day cycles. A cut-off point in time is made where the first 'green' commit after that is picked for the current deployment, and if that fails in an unexpected way you just deploy the last binary back, fix (and test) whatever broke and either try again or just abandon the release if the next cut is already close-by.


You want to build a world where roll back is 95% the right thing to do. So that it almost always works and you don't even have to think about it.

During an incident, the incident lead should be able to say to your team's on call: "can you roll back? If so, roll back" and the oncall engineer should know if it's okay. By default it should be if you're writing code mindfully.

Certain well-understood migrations are the only cases where roll back might not be acceptable.

Always keep your services in "roll back able", "graceful fail", "fail open" state.

This requires tremendous engineering consciousness across the entire org. Every team must be a diligent custodian of this. And even then, it will sometimes break down.

Never make code changes you can't roll back from without reason and without informing the team. Service calls, data write formats, etc.

I've been in the line of billion dollar transaction value services for most of my career. And unfortunately I've been in billion dollar outages.


"Fail open" state would have been improper here, as the system being impacted was a security-critical system: firewall rules.

It is absolutely the wrong approach to "fail open" when you can't run security-critical operations.


Cloudflare is supposed to protect me from occasional ddos, not take my business offline entirely.

This can be architected in such a way that if one rules engine crashes, other systems are not impacted and other rules, cached rules, heuristics, global policies, etc. continue to function and provide shielding.

You can't ask for Cloudflare to turn on a dime and implement this in this manner. Their infra is probably very sensibly architected by great engineers. But there are always holes, especially when moving fast, migrating systems, etc. And there's probably room for more resiliency.


The question is perhaps what the shape and status of their tech stack is. Obviously, they are running at massive scale, and they have grown extremely aggressively over the years. What's more, especially over the last few years, they have been adding new product after new product. How much tech debt have they accumulated with that "move fast" approach that is now starting to rear its head?


I think this is probably a bigger root cause and is going to show up in different ways in future. The mere act of adding new products to an existing architecture/system is bound to create knowledge silos around operations and tech debt. There is a good reason why big companies keep smart people on their payroll to just change couple of lines after a week of debate.


> this sounds like the sort of cowboy decision

Ouch. Harsh given that Cloudflare's being over-honest (to disabling the internal tool) and the outage's relatively limited impact (time wise & no. of customers wise). It was just an unfortunate latent bug: Nov 18 was Rust's Unwrap, Dec 5 its Lua's turn with its dynamic typing.

Now, the real cowboy decision I want to see is Cloudflare [0] running a company-wide Rust/Lua code-review with Codex / Claude...

cf TFA:

  if rule_result.action == "execute" then
    rule_result.execute.results = ruleset_results[tonumber(rule_result.execute.results_index)]
  end

  This code expects that, if the ruleset has action="execute", the "rule_result.execute" object will exist ... error in the [Lua] code, which had existed undetected for many years ... prevented by languages with strong type systems. In our replacement [FL2 proxy] ... code written in Rust ... the error did not occur.
[0] https://news.ycombinator.com/item?id=44159166


From the post:

“We have spoken directly with hundreds of customers following that incident and shared our plans to make changes to prevent single updates from causing widespread impact like this. We believe these changes would have helped prevent the impact of today’s incident but, unfortunately, we have not finished deploying them yet.

“We know it is disappointing that this work has not been completed yet. It remains our first priority across the organization.”


Where I work, all teams were notified about the React CVE.

Cloudflare made it less of an expedite.


> Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.

Also there seems to be insufficient testing before deployment with very junior level mistakes.

> As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:

Where was the testing for this one? If ANY exception happened during the rules checking, the deployment should fail and rollback. Instead, they didn't assess that as a likely risk and pressed on with the deployment "fix".

I guess those at Cloudflare are not learning anything from the previous disaster.


As usual, Cloudflare is the man in the arena.


There are other men in the arena who arent tripping on their own feet


Like who? Which large tech company doesn't have outages?


It's not about outages. It's about the why. Hardware can fail. Bugs can happen. But to continue a roll out despite warning sings and without understanding the cause and impact is on another level. Especially if it is related to the same problem as last time.


[dead]


It is healthy for tech companies to have outages, as they will build experience in resolving them. Success breeds complacency.


You don't need outages to build experience in resolving them, if you identify conditions that increase the risk of outages. Airlines can develop a lot of experience resolving issues that would lead to plane crashes, without actually crashing any planes.


Google does pretty good.


Google docs was just down a couple weeks ago almost the whole day.


"tripping on their own feet" == "not rolling back"


> more to the story

From a more tinfoil-wearing angle, it may not even be a regular deployment, given the idea of Cloudflare being "the largest MitM attack in history". ("Maybe not even by Cloudflare but by NSA", would say some conspiracy theorists, which is, of course, completely bonkers: NSA is supposed to employ engineers who never let such blunders blow their cover.)


Ooh ... I want to be on a cowboy decision making team!!!


My understanding of the contemporary argument against publicly traded companies, though I'm not completely convinced of them personally, are that the fiduciary obligations inevitably drive those bad behaviors, and/or that shareholders often demand short term returns at the expense of long term value.

As far as "fixing" the problem, I think it would be important to expand voters' influence over the company in addition to voting changes like you described. I don't know how to make it feasible, but IMO voters should be able to influence or directly decide much lower level business decisions than they currently do


Out of curiosity, what do you mean by Jeep riding on their reputation?

Based on everything I've seen and heard, Jeep's reputation is for unreliable vehicles that are increasingly difficult to repair. This seems pretty on-brand for that reputation.


Recent reputation, yes. But their old reputation was very positive. They made cars that would survive in any condition (which is why they were popular for military uses).

These days, you're in one of two camps: Either you still believe (because you're ignorant or value the Jeep brand more than you value a reliable vehicle) or you've read the recent reviews and steer clear.

Jeep has been duking it out for the bottom of Consumer Reports ratings for a while now, yet they still seem to sell cars. As they continue to betray their loyal customer base though, I imagine this will change. I wish American car companies were better!


I think you’re conflating a few things. Jeeps, as manufactured during World War II, were produced by Ford and Willys. The Jeeps of today, manufactured by Stellantis, carry on the name (and arguably the general shape) but are completely different vehicles.

They “seem” to sell cars? Well, yes. The Wrangler and the Grand Cherokee are consistently near the top of list of most popular SUVs, year after year.


The point of buying the brand is to conflate reputations.


It's slightly different here. They did seem to have bought a lot of the manufacturing - or at least they're still manufactured in the US? Maybe ex Chrysler factories?

China buying the MG brand was entirely just for reputation - no connection at all.


The older I get the less I care to believe in memes that float around, if everyone online memes about how horrible some product or brand supposedly is. In fact, the more prevalent the memeing is, the more I assume it's either manufactured or has just reached critical level of viral meme where now everyone repeats something simply because everyone else says it.

What percentage of people shitting on some brand actually have owned that brand for many years? And also owned other brands for many years, to be able to compare reliability and have any sort of informed opinion on the topic?

Things like Consumer Reports are just small surveys of the opinion of random members of the population, what they think about the brand, there's no connection to any objective reality about how reliable they vehicles actually are.

In the past I've tried to find a single study that actually compares objective reliability of brands. It does not exist. If you Google for it, everything you will find will eventually, at the bottom of it all, link back to the same Consumer Reports study.

I've owned a 2018 Wrangler for 6 years now, I've put 75k miles on it, many thousands of miles in the most remote places in the country, where if it had issues it'd be a 30 mile hike to safety. It's never once let me down in any way. Never once had a major problem. That's all I care about.


Don't forget the third camp who just really like OLD jeeps!

Somewhere in the ballpark of a week ago there was a car show near where I walk my dog (some charity event). Overall not that interesting - there were a lot of flashy low riders with the crazy hydraulics and stuff - but there was also this really cool jeep truck-thing from sometime in the 1950's, a Jeep Forward Control[0]. They had pics of it when they first got it, absolute rusty mess! But goddamn, I'm not even a car guy and I was impressed. Labor of love.

Then my cousin has a more modern Jeep and lemme tell you: not great. I wonder what happened to that company? Garden variety enshittification, or is there an interesting story there?

[0] https://en.wikipedia.org/wiki/Jeep_Forward_Control


Depends how old you are. Up through the 80s, Jeep still had a reputation as a rock-solid, durable brand. (The reality probably changed sometime in the 1970s, but it takes time for word to get out.) A lot of people's mental model is set somewhere in their 20s/30s, and they never really update it. So a lot of baby boomers still think of Jeep as a reliable car.


That's what I was getting at, though I wasn't sure if my perceptions matched the general consensus (which it seems they do).

If a manufacturer has been broadly considered as unreliable for the past 20-40 years (JK's came out in 2007, and I still heard some people talking about jeeps as being reliable in the TJ era, though I'd personally disagree), I think it's fair to say they have a reputation of being unreliable.


Among my dads friends (lawyers) in the 80s none of them would buy a jeep because they consistently died between 60k and 80k miles. one of them had one but he expected to only put 30k on it on his property. We had to pull it with a tractor on multiple hunts because the 4wd system wouldn't work.


It's interesting to me that all the screenshots besides the HN one appear to be from mobile devices?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: