Hacker Newsnew | past | comments | ask | show | jobs | submit | mozdeco's commentslogin

[working for Mozilla]

That's because there were none. All bugs came with verifiable testcases (crash tests) that crashed the browser or the JS shell.

For the JS shell, similar to fuzzing, a small fraction of these bugs were bugs in the shell itself (i.e. testing only) - but according to our fuzzing guidelines, these are not false positives and they will also be fixed.


> For the JS shell, similar to fuzzing, a small fraction of these bugs were bugs in the shell itself (i.e. testing only)

There's some nuance here. I fixed a couple of shell-only Anthropic issues. At least mine were cases where the shell-only testing functions created situations that are impossible to create in the browser. Or at least, after spending several days trying, I managed to prove to myself that it was just barely impossible. (And it had been possible until recently.)

We do still consider those bugs and fix them one way or the other -- if the bug really is unreachable, then the testing function can be weakened (and assertions added to make sure it doesn't become reachable in the future). For the actual cases here, it was easier and better to fix the bug and leave the testing function in place.

We love fuzz bugs, so we try to structure things to make invalid states as brittle as possible so the fuzzers can find them. Assertions are good for this, as are testing functions that expose complex or "dangerous" configurations that would otherwise be hard to set up just by spewing out bizarre JS code or whatever. It causes some level of false positives, but it greatly helps the fuzzers find not only the bugs that are there, but also the ones that will be there in the future.

(Apologies for amusing myself with the "not only X, but also Y" writing pattern.)


Sounds good.

Did you also test on old source code, to see if it could find the vulnerabilities that were already discovered by humans?


Isn’t that this from the (Anthropic) article:

“Our first step was to use Claude to find previously identified CVEs in older versions of the Firefox codebase. We were surprised that Opus 4.6 could reproduce a high percentage of these historical CVEs”

https://www.anthropic.com/news/mozilla-firefox-security


Anthropic mention that they did beforehand, and it was the good performance it had there that lead to them looking for new bugs (since they couln't be sure that it was just memorising the vulnerabilities that had already been published).

I really like this as a suggestion, but getting opensource code that isn't in the LLMs training data is a challenge.

Then, with each model having a different training epoch, you end up with no useful comparison, to decide if new models are improving the situation. I don't doubt they are, just not sure this is a way to show it.


Yes, but perhaps the impact of being trained on code on being able to find bugs in code is not so large. You could do a bunch of experiments to find out. And this would be interesting in itself.

I guess it is good when bugs are fixed, but are these real bugs or contrived ones? Is anyone doing quality assessment of the bugs here?

I think it was curl that closed its bug bounty program due to AI spam.


The bugs are at least of the same quality as our internal fuzzing bugs. They are either crashes or assertion failures, both of these are considered bugs by us. But they have of course a varying value. Not every single assertion failure is ultimately a high impact bug, some of these don't have an impact on the user at all - the same applies to fuzzing bugs though, there is really no difference here. And ultimately we want to fix all of these because assertions have the potential to find very complex bugs, but only if you keep your software "clean" wrt to assertion failures.

The curl situation was completely different because as far as I know, these bugs were not filed with actual testcases. They were purely static bugs and those kinds of reports eat up a lot of valuable resources in order to validate.


The bugs that were issued CVEs (the Anthropic blog post says there were 22) were all real security bugs.

The level of AI spam for Firefox security submissions is a lot lower than the curl people have described. I'm not sure why that is. Maybe the size of the code base and the higher bar to submitting issues plays a role.


The issue is what each of the projects considers viable bug if you consider all localized assertion failures possible bugs then that's different from give me something that practically affects users.

Further browsers have a much larger surface area for even minor fuzzing bugs. Curl's much smaller surface area is already well fuzzed and tested.

Chrome has better fuzzing and tests too. Firefox has had fewer resources compared to Google ofc, so understable.

Ofc not saying it wasn't good. But given the LLM costs I find it hard believe it was worth it, compared to just better and more innovative fuzzing which would possibly scale better.


Any particular reason why the number of vulnerabilities fixed in Feb. was so high? Even subtracting the count of Anthropic's submissions, from the graph in their blog post, that month still looks like an outlier.

[work at Mozilla]

I agree that LLMs are sometimes wrong, which is why this new method here is so valuable - it provides us with easily verifiable testcases rather than just some kind of analysis that could be right or wrong. Purely triaging through vulnerability reports that are static (i.e. no actual PoC) is very time consuming and false-positive prone (same issue with pure static analysis).

I can't really confirm the part about "local" bugs anymore though, but that might also be a model thing. When I did experiments longer ago, this was certainly true, esp. for the "one shot" approaches where you basically prompt it once with source code and want some analysis back. But this actually changed with agentic SDKs where more context can be pulled together automatically.


My point is that "verifiable testcases" works great for proving "this is vulnerable" but LLMs are still risky if you believe "this is safe", which you can't easily prove. My point is that you need to be very skeptical of when they decide that something isn't vulnerable.

I completely agree that LLMs are great when instructed to provide provable, repeatable exploits. I have done this multiple times and uncovered some neat bugs.

> I can't really confirm the part about "local" bugs anymore though, but that might also be a model thing.

I don't think it's a model thing, it's just a sort of basic limitation of the technology. We shouldn't expect LLMs to perform novel tasks so we shouldn't expect LLMs to find novel vulnerabilities.

Agents help, human in the loop is critical for "injecting novelty" as I put it. The LLM becomes great at producing POCs to test out.


Please, implement "name window" natively in Firefox.

I have to use chrome because the lack of it.



Sort of. It won't be save between machines, for example, as chrome's implementation does. If Firefox crashes, most of th time it is lost. It is also not as clean as chrome's native implementation. I have tried it.

This has been requested since 2022: https://connect.mozilla.org/t5/ideas/user-defined-name-for-e...



How telling that I get a

> Just a moment...

and

> Enable JavaScript and cookies to continue

just to look at an official blog post. Sigh.

https://web.archive.org/web/20260306133059/https://blog.mozi... for those of us who prefer not having to open our browsers to additional risk just to learn about how our browsers are supposed to be getting less risky.


The infinite busy loop in this case was not the tab no (neither visible or invisible). The loop was directly in the network stack, as stated in the post, not in the caller.


> code that can end up blocking forever should have a timeout and recover from that timeout happening.

There was no way for the calling code to do this. This was literally an infinite loop inside the network stack. Imagine the network stack itself going `while(1) {}` on you, without checking if the request was canceled.

Even if you detect that this happens, there is nothing you can do as the caller. You can't even properly stop the thread, as it is not cooperating. So recovering from this type of failure is hard.


> There was no way for the calling code to do this

Like what happened in a comment that I called out yesterday, you're silently inserting extra qualifiers that aren't in the original; the person you're responding to didn't say anything about calling code.

If the network stack can end up doing the equivalent of `while(1) { /.../ }`, then that's the bug, no matter what's in the ellided part. There's not "no way" to deal with this. (In the specific case of `while(1)`—which I recognize is a metaphor and not a case study, so onlookers should please spare us the sophomoric retort—it's as simple as changing to `while(i < MAX_TRIES)` with some failover checks.) In some industries, this sort of thing is mandatory.


It's a bug. Are you saying there's some magical way of eliminating all possible infinite loops from code? Please write a paper on this amazing technique; I'm pretty sure that's equivalent to solving the halting problem and the computer science community would love to see a proven unsolvable problem being solved.


You write good comments usually, so IMHO this comment is worth replying to:

There is no algorithm that will determine the "halting status" of an arbitrary (program, input) pair, but that does not prevent a team of programmers from working in a subset of the set of all programs in which every program halts. Restricting themselves to that subset might make the team less productive (i.e., raise the cost of implementing things), but it probably does not materially limit what the team can accomplish (i.e., what functionality the team can implement) provided they're not developing a "language processor" (a program that takes another program as input).


Your desire for your insolence to be noted is granted, but to answer the non-strawman form of your question: yes, there is a way to prevent infinite loops from making their way into software in the field. It means providing proofs that your loops terminate. (If you can't show this, your code has to be rewritten into something that you can come up with a proof for.) As I already said, this is mandatory in some industries. The philosophy is also not far off from the rationale for Rust's language design re memory management. And although it might seem like it requires it, there's no need for magic. This is something covered in any ("every"?) decent software engineering program.


I went and looked at the code (it's linked in the article). You absolutely can put a timeout around a case/switch statement. There's like 5 different ways to do it. And the code calling network syscalls can also have timeouts, obviously; otherwise nobody would ever be able to time out any blocked network operation. This is all network programming 101.


If it's that easy, I'm sure they'd accept your pull request.


All requests go through one socket thread, no matter which HTTP version. I am not a Necko engineer, but since requests can be upgraded, an HTTP/1 request could switch to HTTP/2 and if there was a separation by protocol, the request would have to be "moved" to a different thread. So I'm not sure that would work easily.


> the fix has to be in the code that communicates back, it should fail gracefully.

The bug that caused the hang was in the network stack itself. There was no way the calling code could have prevented this in any way. You can see this by taking a look at the linked HTTP3 code. It's not that the higher-level code kept retrying over and over causing the hang, that was not the problem here.

Under "Lessons learned" you can also read "investigating action points both to make the browser more resilient towards such problems". I agree that this is broadly spoken, but it covers ideas that would have made this technically recoverable (e.g. can network requests be compartmentalized to not block on a single network thread?).


> There was no way the calling code could have prevented this in any way.

It could have prevented it by not making the call in the first place.


As explained in the article, this problem was not specific to Telemetry:

“This is why users who disabled Telemetry would see this problem resolved even though the problem is not related to Telemetry functionality itself and could have been triggered otherwise.”

Since a browser's job is to make HTTP requests, a bug in the network stack would almost certainly have been hit in other places. This was highly-visible so it was quickly noticed but it's quite possible that a less frequent trigger could have plagued Firefox users for a much longer period of time as HTTP/3 adoption increases.


The article specifically states that normal web requests went through a different code path that did not trigger the bug. That the bug was not technically in the telemetry code is irrelevant - it happened without user interaction because of telemetry and it did not happen (at least as often) with telemetry disabled. Saying that there was no way to prevent it assumes that telemetry could not have been disabled/removed, which is false.


The article provides the correct logic: Telemetry was the first to use that combination of new code but there's no reason to believe that nothing else would ever have used the stack they've been transitioning towards. Had this bug not been found in Telemetry it would have shown up somewhere else, possibly harder to diagnose.


At this point, the code relied on the Content-Length header being present because the higher-level API was supposed to add it. The field that is supposed to be populated by Content-Length (mRequestBodyLenRemaining) is pre-initialized to 0.


Firefox generally does not block if a remote connection does not work. As explained in the post, the infinite loop was a bug in the network stack itself.

So yes, you can use Firefox in any offline environment.


This is absolutely true and hence we combine not only our tests with TSan, but also fuzzing, to explore even more corner cases.

On the static vs. dynamic side, I would always opt for the dynamic when it can guarantee me no false positives, even if the results are incomplete. It is pretty much impossible to deploy a tool that produces lots of false positives because developers usually will reject it at some point and question every result.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: