jonasn's comments

jonasn · 2026-02-25T22:54:31 1772060071

Every GC algorithm in HotSpot is designed with a specific set of trade-offs in mind.

ZGC and G1 are fantastic engineering achievements for applications that require low latency and high responsiveness. However, if you are running a pure batch data pipeline where pause times simply don't matter, Parallel GC remains an incredibly powerful tool and probably the one I would pick for that scenario. By accepting the pauses, you get the benefit of zero concurrent overhead, dedicating 100% of the CPU to your application threads while they are running.

cogman10 · 2026-02-25T22:58:35 1772060315

Gotta be honest, I have a hard time arguing for G1 over ZGC. It seems to me like any situation you'd want G1 you probably want ZGC instead. That default 200ms target latency is already pretty long. If you've made that tradeoff for G1 because you wanted lower latency, you probably are going to be happier with ZGC.

I also find that the parallel collector is often better than G1, particularly for small heaps. With modern CPUs, parallel is really fast. Those 200ms pauses are pretty easy to achieve if you have something like a 4gb heap and 4 cores.

The other benefit of the parallel collector is the off heap memory allocation is quiet low. It was a nasty surprise to us with G1 how much off heap memory was required (with java 11, I know that's gotten a lot better).

spockz · 2026-02-26T12:41:09 1772109669

We have many apps that run on <1 core just fine for the business logic and run on K8S. If we then use a parallel or concurrent garbage collector it will eat through the cpu limit of the app in a blink causing the process not to be scheduled for several ticks. This introduces more latency than the GC cycles themselves would when using a serial GC than runs frequently enough.

jonasn · 2026-02-25T22:18:04 1772057884

Great question! I actually just touched on this in another thread that went up right around the same time you asked this. It is clearly the next big frontier!

The short answer is: It's something I'm actively thinking about, but instrumenting micro-level events (like ZGC's load barriers or G1's write barriers) directly inside application threads without destroying throughput (or creating observer effects invalidating the measurements) is incredibly difficult.

magicalhippo · 2026-02-26T16:37:42 1772123862

> instrumenting micro-level events (like ZGC's load barriers or G1's write barriers) directly inside application threads without destroying throughput (or creating observer effects invalidating the measurements) is incredibly difficult

I've used a sampling profiler with success to find lock contention in heavily multithreaded code, but I guess there are some details that makes it not viable for this?

firefly2000 · 2026-02-26T00:57:53 1772067473

Do you think it can be done by adjusting GC aggressiveness (or even disabling it for short periods of time) and correlating it with execution time?

jonasn · 2026-02-26T11:29:53 1772105393

That is spot on. Effectively disabling GC to establish a baseline is exactly the methodology used in the Blackburn & Hosking paper [1] I referenced.

In general, for a production JVM like HotSpot, the implicit cost comes largely from the barriers (instructions baked directly into the application code). So even if we disable GC cycles, those barriers are still executing.

If we were to remove barriers during execution, maintaining correctness becomes the bottleneck. We would need a way to ensure we don't mark a live (reachable) object as dead the moment we re-enable the collector.

[1] https://dl.acm.org/doi/pdf/10.1145/1029873.1029891

babol · 2026-02-26T11:49:33 1772106573

Would running an application with chosen GC, subtracting GC time reported by methods You introduced, and then comparing with Epsilong-based run be a good estimate of barrier overhead ?

Thank you for the well written article!

jonasn · 2026-02-26T12:03:15 1772107395

That is a creative idea, but unfortunately, Epsilon changes the execution profile too much to act as a clean baseline for barrier costs.

One huge issue is spatial locality. Epsilon never reclaims, whereas other GCs reclaim and reuse memory blocks. This means their L2/L3 cache hit rates will be fundamentally different.

If you compare them, the delta wouldn't just be the barrier overhead; it would be the barrier overhead mixed with completely different CPU cache behaviors, memory layout etc. The GC is a complex feedback loop, so results from Epsilon are rarely directly transferable to a "real" system.

jonasn · 2026-02-24T14:23:55 1771943035

Hi HN, I'm the author of this post and a JVM engineer working on OpenJDK.

I've spent the last few years researching GC for my PhD and realized that the ecosystem lacked standard tools to quantify GC CPU overhead—especially with modern concurrent collectors where pause times don't tell the whole story.

To fix this blind spot, I built a new telemetry framework into OpenJDK 26. This post walks through the CPU-memory trade-off and shows how to use the new API to measure exactly what your GC is costing you.

I'll be around and am happy to answer any questions about the post or the implementation!

spockz · 2026-02-25T21:54:29 1772056469

Thank you for this interface! It will definitely help in tracking down GC related performance issues or in selecting optimal settings.

One thing that I still struggle with, is to see how much penalty our application threads suffer from other work, say GC. In the blog you mention that GC is not only impacting by cpu doing work like traversing and moving (old/live) objects but also the cost of thread pauses and other barriers.

How can we detect these? Is there a way we can share the data in some way like with OpenTelemetry?

Currently I do it by running a load on an application and retaining its memory resources until the point where it CPU skyrockets because of the strongly increasing GC cycles and then comparing the cpu utilisation and ratio between cpu used/work.

Edit: it would be interesting to have the GC time spent added to a span. Even though that time is shared across multiple units of work, at least you can use it as a datapoint that the work was (significantly?) delayed by the GC occurring, or waiting for the required memory to be freed.

jonasn · 2026-02-25T22:09:24 1772057364

Thanks for reading! Your current method, pushing the load until the GC spirals and then comparing the CPU utilization, is exactly the painful, trial-and-error approach I'm hoping this new API helps alleviate.

You've hit on the exact next frontier of GC observability. The API in JDK 26 tracks the explicit GC cost (the work done by the actual GC threads). Tracking the implicit costs, like the overhead of ZGC's load barriers or G1's write barriers executing directly inside your application threads, along with the cache eviction penalties, is essentially the holy grail of GC telemetry.

I have spent a lot of time thinking about how to isolate those costs as part of my research. The challenge is that instrumenting those barrier events in a production VM without destroying application throughput (and creating observer effects) is incredibly difficult. It is absolutely an area of future research I am actively thinking about, but there isn't a silver bullet for it in standard HotSpot just yet.

Something that you could look at there are some support to analyze with regards to thread pauses is time to safepoint.

Regarding OpenTelemetry. MemoryMXBean.getTotalGcCpuTime() is exposed via the standard Java Management API, so it should be able to hook into this.

spockz · 2026-02-26T12:30:28 1772109028

After writing my previous post I was wondering, do we actually need to instrument the barrier events and other code tied to a GC? Currently we benchmark our application with different GC at different settings and resource constraints and the we pick one sizing and settings combination that we like (read most work/totalcpu that is still fits within the allocation constraints of our clusters). What ultimately matters for production is how the app behaves in production.

This will not help directly when developing new (versions) or GC. On the other hand, if we can have a noop GC including omitting any of the barriers etc required for GC to function we can create a baseline for apps. Provided we have enough total memory to run the benchmark in.

Edit: I guess we can then also use perf to compare cache misses between runs with different GC implementations and settings. Not sure how this works out in real life as it will be very CPU, kernel, and other loads dependent.

yvdriess · 2026-02-26T20:48:21 1772138901

The problem is that there is no baseline for measuring GC overhead. You cannot turn it off, you can only replace and compare with different strategies. For example sbrk is technically a noop GC, but that also has overhead and impact because it will not compact objects and give you bad cache behavior. (It illustrates the OP's point that it is not enough to measure pauses, sbrk has no pauses but gets outperformed easily.)

You could stop collecting performance counters around GC phases, but you even if you are not measuring the CPU still runs through its instructions, causing the second order effects. And as you mentioned too-short-to-measure barriers and other bookkeeping overheads (updating ref counters etc) or simply the fact that some tag bits or object slots are reserved all impact performance.

There is a good write-up of the problem and a way to estimate the cost based on different GC strategies, as you suggested, here: https://arxiv.org/abs/2112.07880

The way I found to measure a no-GC baseline is to compare them in an accurate workload performance simulator. Mark all GC and allocator related code regions and have the simulator skip all those instructions. Critically that needs to be a simulator that does not deal with the functional simulation, but gets it's instructions from a functional simulator, emulator or PIN tool that does execute everything. It's laborious, not very fast and impractical for production work. But, it's the only way I found to answer a question like "What is the absolite overhead of memory management in Python?". (Answer: lower bound walltime sits around +25% avg, heavily depending on the pyperformance benchmark)

abbeyj · 2026-02-27T04:19:08 1772165948

I'm a bit confused about the colors used in the CPU graphs. In the first graphs it looks like green means that the application is running and red means that the GC is running. But once we get to Figure 4 then red means the GC is running (on the GC threads) or nothing is running (on the Main thread)? If red always means that GC work is being done on that thread then this is inconsistent with the text that says "By distributing reclamation work across both cores..." since we would have three threads running at once. Once you move to the concurrent GC figures you definitely have three things running at once. Unless you're assuming SMT with each core running two threads?

In Figure 3 you somehow have 101% wall time. :)

jonasn · 2026-02-27T10:40:56 1772188856

Thanks for the detailed read and the great questions!

Regarding the colors and thread counts in Figure 4: the key piece of context here is that the application thread (the Main thread) is completely paused during this phase. It isn't actually running anything at all. Because the application is halted, only the GC threads are doing active work. Therefore, rather than three threads running at once, we strictly have two things running concurrently. This is a helpful piece of feedback and I'll make sure to make this clearer in future writings.

Good eye on the 101% wall time. That was due to a minor bug in my plotting script that specifically affected the GC plots with no concurrent time. I have corrected this and updated the post. The fixed plot should be visible on the site in a future near you just as soon as the edge caches invalidate.

yunnpp · 2026-02-26T02:22:15 1772072535

Hey, noob question, but does OpenJDK look at variable scope and avoid allocating on the heap to begin with if a variable is known to not escape the function's stack frame?

Not strictly related to this post, but I figured it'd be helpful to get an authoritative answer from you on this.

pgrulich · 2026-02-26T06:16:28 1772086588

Yes, Hotspot performs Escape Analysis to avoid heap allocation. This is a nice article: https://shipilev.net/jvm/anatomy-quarks/18-scalar-replacemen...

latchkey · 2026-02-26T02:15:32 1772072132

I built this 15 years ago and it got fairly popular, but is long dead now...

https://github.com/jmxtrans/jmxtrans

Kind of amazing how people are still building telemetry into Java. Great post and great work. Keep it up.

sitta · 2026-02-26T03:04:41 1772075081

Great article!

Will the new metric be exposed in JFR recordings as well?

jonasn · 2026-02-26T11:17:37 1772104657

Thanks!

It is not currently exposed in JFR for JDK 26, but I agree that it would be the logical next step. Now that the underlying telemetry framework (cpuTimeUsage.hpp) is in place within HotSpot, wiring it up to JFR events would be a natural extension.

exabrial · 2026-02-26T02:20:49 1772072449

I just want to say this is an incredibly detailed, well written, and beautifully illustrated article. Solid work.

jonasn · 2026-02-26T11:15:09 1772104509

Thanks! I really appreciate that. I spent a lot of time trying to nail the illustrations so I'm really glad it landed well. :-)

jonasn · 2026-01-14T08:48:32 1768380512

Author of the OpenJDK patch here.

Thanks for the write-up Jaromir :) For those interested, I explored memory overhead when reading /proc—including eBPF profiling and the history behind the poorly documented user-space ABI.

Full details in my write-up: https://norlinder.nu/posts/User-CPU-Time-JVM/

jerrinot · 2026-01-14T09:12:10 1768381930

Hi Jonas, thanks for the work on OpenJDK and the post! I swear I hadn't seen your blog :) I finished my draft around Christmas and it’s been in the queue since. Great minds think alike, I guess.

edit: I just read your blog in full and I have to say I like it more than mine. You put a lot more rigor into it. I’m just peeking into things.

edit2: I linked your article from my post.

jonasn · 2026-01-14T19:03:26 1768417406

Thanks for the kind words and the link :).

kstrauser · 2026-01-14T17:21:15 1768411275

Why do you suppose it was originally written the way it was? To my eyes, that seems like a horrible approach. Doing file IO and parsing strings in every call? What?! And yet I assume the original author was a smart person who had a reason why this made sense to them, and my inability to guess why is my own limitation and not theirs.

So, why do you reckon they did that?

jonasn · 2026-01-14T19:02:55 1768417375

You are spot on that the original author had a valid reason: at the time, it was literally the only way to do it.

The method in question (Java 1.5) was released in September 2004. While the POSIX standard existed, it only provided a way to get total CPU time, not the specific user time that Java needed. You can read about it more in the history section here: https://norlinder.nu/posts/User-CPU-Time-JVM/#a-walk-through....

But it's worth noting that while this specific case can be "fixed" with a function call, parsing /proc is still the standard way to get data in Linux.

Even today, a vast amount of kernel telemetry is only exposed via the filesystem. If you look at the source code for tools like htop, they are still busy parsing text files from /proc to get memory stats (/proc/meminfo), network I/O, or per-process limits. See here https://github.com/hishamhm/htop/blob/master/linux/LinuxProc....

kstrauser · 2026-01-14T19:28:27 1768418907

That sounds like a pretty good reason!

I knew about using proc for all that other information. I just wouldn’t have imagined using it for critical performance path. Unless, that is, that’s the way you have to get the information.