Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

OK, bear with me as we get slightly contrived here...(and deviate a little from my earlier, somewhat less thought out example).

Say your workload has some inherent requirement for random (throughout its entire execution) access to a dataset of size N. If you run it on a single processor with a cache of size N/2, you'll see a lot of cache misses that end up getting serviced from the next level in the storage hierarchy, slowing down execution a lot. If you add another processor with another N/2 units of cache, they'll both still see about the same cache miss rate, but cache misses then don't necessarily have to be satisfied from the next level down -- they can instead (at least some of the time) be serviced from the other processor's cache, which is likely to be significantly faster (whether you're talking about CPU caches relative to DRAM in an SMP system or memory relative to disk between two separate compute nodes over a network link).

For a more concrete example somewhat related (though not entirely congruent) to the above scenario, see http://www.ece.cmu.edu/~ece845/docs/lardpaper.pdf (page 211).



Hmn. I think the reason there's superlinear speedup in the paper you linked is because the requests must be serviced in order. If you only care about throughput, and you can service the requests out-of-order, then you can use LARD in a serial process too, to improve cache locality, and achieve speed 1/Nth that of N-machine LARD. But to serve requests online, you can't do that reordering, so with one cache you'd be constantly invalidating it, thus the increased aggregate cache across the various machines results in superlinear speedup.

So, mission accomplished! I now believe that superlinear speedup is a real thing, and know of one example!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: