Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

My only complaint is that using OpenMP Eigen can be slower with SMT than without SMT. They even suggest telling to use half as many threads as you have "cores" when "cores" means twice as many due to SMT.

Otherwise, we've seen a 8-10x performance increase in SolveSpace (CAD) in some situations after switching from home-grown matrix operations to Eigen.



This has been true for years, Intel CPU's can't efficiently perform math operations when hyperthreading is involved. Generally there is only a single shared FPU/AVX/SSE unit doing the math over two hyperthreads. Since the Eigen implementation often can keep that unit 100% busy, it makes no sense to try and run two threads at full tilt through the units.

I tested all this very heavily before Eigen had AVX-512 support. In that environment there might be some differences and I would suggest you benchmark both configurations.


> Generally there is only a single shared FPU/AVX/SSE unit doing the math over two hyperthreads.

Hyperthreading does in general share units (both ALU and others); that's what hyperthreading is.

Apart from that, it really depends on what operations you're doing; e.g., modern Intel CPUs have three ports that can issue a 256-bit FMA, each, every cycle.


>> Hyperthreading does in general share units (both ALU and others); that's what hyperthreading is.

Yes, I think the issue with Eigen is cache related. They apparently have optimizations that are aware of cache architecture and running 2 threads that share the same cache will screw that up, resulting in more misses. If this is the case, I'd prefer algorithms that are cache line size agnostic. It is still much faster than the simple hand-written code we had before!


I think this is generally true if your workload is SIMD/AVX heavy: these types of “heavy” instructions cannot execute on a single core simultaneously.


If you have Eigen-like code that won't tend to have many cases where you're not having many branch mispredicts or loads the prefetcher can't figure out and you also have enough calculations that you can use the width of the core on a single thread then there really isn't any potential throughput gain with SMT but you still suffer from cache contention from having two threads. It's really not Eigen's fault, it's the nature of SMT that it doesn't help in all cases.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: