Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Cache-Oblivious Data Structures: https://cs.au.dk/~gerth/MassiveData02/notes/demaine.pdf

A vaguely related notion is that naive analysis of big-O complexity in typical CS texts ignores over the increasing latency/cost of data access as the data size grows. This can't be ignored, no matter how much we would like to hand-wave it away, because physics gets in the way.

A way to think about it is that a CPU core is like a central point with "rings" of data arranged around it in a more-or-less a flat plane. The L1 cache is a tiny ring, then L2 is a bit further out physically and has a larger area, then L3 is even bigger and further away, etc... all the way out to permanent storage that's potentially across the building somewhere in a disk array.

In essence, as data size 'n' grows, the random access time grows as sqrt(n), because that's the radius of the growing circle with area 'n'.

Hence, a lot of algorithms that on paper have identical performance don't in reality, because one of the two may have an extra sqrt(n) factor in there.

This is why streaming and array-based data structures and algorithms tend to be faster than random-access, even if the latter is theoretically more efficient. So for example merge join is faster than nested loop join, even though they have the same performance in theory.



Realtime collision detection[1] has a fantastic chapter in this with some really good practical examples if I remember right.

Great book, I used to refer to it as 3D "data structures" book which is very much in theme with this thread.

[1] https://www.amazon.com/Real-Time-Collision-Detection-Interac...


The implicit grid data structure from there is a personal favorite of mine. I used it in a game once and it performed incredibly well for our use case.

It's a bit too complicated to totally summarize here, but it uses a bit per object in the scene. Then bit-wise operations are used to perform quick set operations on objects.

This data structure got me generally interested in algorithms that use bits for set operations. I found the Roaring Bitmap github page has a lot of useful information and references wrt this topic: https://github.com/RoaringBitmap/RoaringBitmap


I spent a lot of time optimizing an octree for ray tracing. It is my favorite "spatial index" because it's the only one I know that can be dynamically updated and always results in the same structure for given object placement.


It's a stellar book. I work on a commercial realtime physics engine, the "orange book" is practically required reading here.


I often recommend this book, RTCD, as a "better intro to data structures and algorithms" (sometimes as "the best"). The advice often falls on deaf ears, unfortunately.


Seconding the RTCD recommendation. Handy code examples, and my favorite part is that the book is real-world-performance-conscious (hence the "real-time" in the title).


> In essence, as data size 'n' grows, the random access time grows as sqrt(n), because that's the radius of the growing circle with area 'n'.

I was about to write a comment suggesting that if we made better use of three dimensional space in constructing our computers and data storage devices, we could get this extra latency factor down to the cube root of n.

But then, I decided to imagine an absurdly large computer. For that, one has to take into account a curious fact in black hole physics: the radius of the event horizon is directly proportional to the mass, rather than, as one might expect, the cube root of the mass [1]. In other words, as your storage medium gets really _really_ big, you also have to spread it out thinner and thinner to keep it from collapsing. No fixed density is safe. So, in fact, I think the extra factor for latency in galactic data sets is neither the square root nor cube root, but n itself!

[1] https://en.wikipedia.org/wiki/Schwarzschild_radius


The main difficulty with growing computing devices in three dimensions is cooling. All that heat has to be somehow carried away. It's already difficult with 2D devices. One would have to incorporate channels for cooling liquids into the design, which takes a bite out of the gains from 3D.

https://www.anandtech.com/show/12454/analyzing-threadripper-...


This series of blog posts explains this sqrt behaviour quite nicely, covering both theory (including black holes) and practice:

http://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html


If its stored far away enough re-doing the calculation is faster. Depending on how much accuracy you need you might as well load a 2022-07-22 07:26:34 earth and have that guy on HN evaluate his thoughts all over again.


How do you accurately include those sqrt(n)’s in your analysis?


I spent a long time looking at an algorithm for long range force calculations called the Fast Multipole Method which is O(N) and found that practically it couldn't compete against an O(N log N) method we used that involved FFTs because the coefficient was so large that you'd need to simulate systems way bigger than is feasible in order for it to pay off because of cache locality, etc.


Honestly, that is not too uncommon. For top level algorithm choice analysis, "rounding" O(log(n)) to O(1), and O(n*log(n)) to O(n), is often pretty reasonable for algorithm comparison, even though it is strictly incorrect.

It is not uncommon for a simple O(n*log(n)) algorithm to beat out a more complex O(n) algorithm for realistic data sizes. If you "round" them to both O(n), then picking the simpler algorithm because the obvious choice, and can easily turn out to have better perf.


To be fair, in the general case (particles placed in arbitrary positions), it worked very well. The issue was that in the area of electromagnetics I worked on at the time, we make an assumption about things being placed on a regular grid, and that allows you to do the force calculation by convolving a kernel with the charges/dipole strength in Fourier space.


A helpful little heuristic that always stuck with me from a Stanford lecture by Andrew Ng was, "log N is less than 30 for all N".



This implementation and associated paper improve the cache locality to make the algorithm more competitive https://github.com/dmalhotra/pvfmm

They compare solving toy problems with different methods here https://arxiv.org/pdf/1408.6497.pdf

But as you mention, this is just fiddling with constants, and it still may be too costly for your problem


I saw an amazing presentation on Purely Functional Data structures with a really approachable and understandable proof on ho w a particular structure was asymptotically constant or linear time (I believe), and then followed up by saying that in practice none of it worked as well as a mutable version because of caching and data locality.

Oh well.


Is there by any chance a publicly available recording of this presentation on purely functional data structures?


NE Scala Symposium Feb 2011

http://vimeo.com/20262239 Extreme Cleverness: Functional Data Structures in Scala

or maybe

http://vimeo.com/20293743 The Guerrilla Guide to Pure Functional Programming

(At work, so I can't check the video)

Both of them were excellent speakers


Does anyone actually use cache-oblivious data structure in practice? Not, like, "yes I know there's a cache I will write a datastructure for that", that's common, but specifically cache-oblivious data structures?

People mention them a lot but I've never heard anyone say they actually used them.


To an extent, the B-Tree data structure (and its variants) are cache oblivious. They smoothly improve in performance with more and more cache memory, and can scale down to just megabytes of cache over terabytes of disk.

The issue is that the last tree level tends to break this model because with some caching models it is "all or nothing" and is the biggest chunk of the data by far.

There are workarounds which make it more truly cache oblivious. How often this is used in production software? Who knows...


There are also Judy arrays: https://en.wikipedia.org/wiki/Judy_array


This sounded promising but then I saw a performance benchmark [1] that was done on a pentium(!) with a very suboptimal hash table design compared to [2] and [3].

The Judy site itself [4] is full of phrases like "may be out of date", when the entire site was last updated in 2004. If it was already out of date in 2004...

It seems that Judy arrays are just kind of a forgotten datastructure, and it doesn't look promising enough for me to put in effort to revive it. Maybe someone else will?

[1] http://www.nothings.org/computer/judy/

[2] https://probablydance.com/2017/02/26/i-wrote-the-fastest-has...

[3] https://probablydance.com/2018/05/28/a-new-fast-hash-table-i...

[4] http://judy.sourceforge.net/


Judy arrays are (under the covers) a kind of radix tree, so they are comparable to things like ART (adaptive radix tree), HAMT (hashed array-mapped trie), or my qp-trie. Judy arrays had some weird issues with patents and lack of source availability in the early days. I think they are somewhat overcomplicated for what they do.

Regarding cache-friendiness, a neat thing I discovered when implementing my qp-trie is that it worked amazingly well with explicit prefetching. The received wisdom is that prefetching is worthless in linked data structures, but the structure of a qp-trie makes it possible to overlap a memory fetch and calculating the next child, with great speed benefits. Newer CPUs are clever enough to work this out for themselves so the explicit prefetch is less necessary than it was. https://dotat.at/prog/qp/blog-2015-10-11.html

I guess I partly got the idea for prefetching from some of the claims about how Judy arrays work, but I read about them way back in the 1990s at least 15 years before I came up with my qp-trie, and I don't know if Judy arrays actually use or benefit from prefetch.


It seems to me that the project is still active,: https://sourceforge.net/p/judy/bugs/29/


Cache-obliviousness is an important part of the whole array-of-structs vs structs-of-arrays. One of the advantages of the struct-of-arrays strategy is that it is cache-oblivious.


Cache oblivious data structures are absolutely used in every serious database.

From what I remember/can find online (don't take this as a reference): B-tree: relational DB, SQLite, Postgres, MySQL, MongoDB. B+-tree: LMDB, filesystems, maybe some relational DBs. LSM trees (Log-structured merge tree, very cool) for high write performance: LevelDB, Bigtable, RocksDB, Cassandra, ScyllaDB, TiKV/TiDB. B-epsilon tree: TokuDB, not sure if it exists anymore. COLA (Cache-oblivious lookahead array): I don't know where it's used.

Maybe modern dict implementations can qualify too, e.g. Python.


Yes, LSM trees are specifically designed with this propery in mind.


Doing a decent job of analysis would include the sqrt(n) memory access factor. Asymptotic analysis ignores constant factors, not factors that depend on data size. It's not so much 'on paper' as 'I decided not to add a significant factor to my paper model'.

Cache oblivious algorithms attempt to make the memory access term insignificant so you could continue to ignore it. They are super neat.


> the increasing latency/cost of data access as the data size grows

Latency Numbers Everyone Should Know https://static.googleusercontent.com/media/sre.google/en//st...


The ratios are mostly correct, but the absolute numbers are wrong these days. For example, memory read latency can be 45ns (not 100) with a decent DDR4 kit and data center latency can be 10us (not 500us) with infiniband.


Do we mean different things by "merge join" and "nested loop join" ? For me "merge join" is O(n) (but requires the data to be sorted by key) whereas "nested loop join" is O(n^2).


Wouldn't you at least be looking at nlog(n) for the sort in the merge join?


Yes, if the data is not already sorted. Thus it's O(n) for already sorted data and O(n log(n) + n) -- which simplifies to O(n log(n)) -- for arbitrary data.


Yeah. I know. Why would the data already be sorted?


In a database you might have already built an index on that data.


I was experimenting a while ago with something that I think is related to this.

If you have a quadratic algorithm where you need to compare each pair of objects in a large array of objects, you might use a loop in a loop like this:

    for (int i = 0; i < size; i++) {  for (int j = 0; j < size; j++) { do_something(arr[i], arr[j]); }  }
When this algorithm runs, it will access every cache line or memory page in the array for each item, because j goes through the whole array for each i.

I thought that a better algorithm might do all possible comparisons inside the first block of memory before accessing any other memory. I wanted this to work independently of the size of cache lines or pages, so the solution would be that for each power of 2, we access all items in a position lower than that power of 2 before moving to the second half of the next power of 2.

This means that we need to first compare indexes (0,0) and then (1,0),(1,1),(0,1) and then all pairs of {0,1,2,3} that haven't yet been compared and so on.

It turns out that the sequence (0,0),(1,0),(1,1),(0,1) that we went through (bottom-left, bottom-right, top-right, top-left in a coordinate system) can be recursively repeated by assuming that the whole sequence that we have done so far is the first item in the next iteration of the same sequence. So the next step is to do (0,0),(1,0),(1,1),(0,1) again (but backwards this time) starting on the (2,0) block so it becomes (2,1),(3,1),(3,0),(2,0) and then we do it again starting on the (2,2) block and then on the (0,2) block. Now we have done a new complete block that we can repeat again starting on the (4,0) block and then (4,4) and then (0,4). Then we continue like this through the whole array.

As you can see, every time we move, only one bit in one index number is changed. What we have here is actually two halves of a gray code that is counting up, half of it in the x coordinate and the other half in y. This means that we can iterate through the whole array in the described way by making only one loop that goes up to the square of the size and then we get the two indexes (x and y) by calculating the gray code of i and then de-interleaving the result so that bits 0,2,4... make the x coordinate and bits 1,3,5... make y. Then we compare indexes x and y:

    for (int i = 0; i < size * size; i++) { get_gray_xy(i, &x, &y); do_something(arr[x], arr[y]); }
The result of all this was that the number of cache line/page switches went down by orders of magnitude but the algorithm didn't get faster. This is probably for two reasons: incremental linear memory access is very fast on modern hardware and calculating the gray code and all that added an overhead.


While reading your post I kept thinking "de-interleave the bits of a single counter" since I have used that before to great benefit. It does suffer from issues if your sizes are not powers of two, so your grey code modification seems interesting to me. I'll be looking in to that.

FYI this also extends to higher dimensions. I once used it to split a single loop variable into 3 for a "normal" matrix multiplication to get a huge speedup. Unrolling the inner 3 bits is possible too but a bit painful.

Also used it in ray tracing to scan pixels in Z-order, which gave about 10 percent performance improvement at the time.

But yes, I must look at this grey code step to see if it makes sense to me and understand why its not helping you even with reduced cache misses.


Interestingly I get more cache misses with the gray code solution but still fewer switches between cache lines. This has to be because the cache line prefetcher can predict the memory usage in the simple version but it's harder to predict in the gray code or de-interleave version.

Also, I have never thought about de-interleaving bits without using gray code but that seems to work similarly. The main reason I used gray code is that then we don't have to switch memory blocks as often since only one of the indexes changes every time so one of the current memory blocks can always stay the same as in the previous iteration.


This is similar to what I'm working on.

I am working on machine sympathetic systems. One of my ideas is concurrent loops.

Nested loops are equivalent to the Nth product of the loop × loop × loop or the Cartesian product.

  for letter in letters:
   for number in numbers:
    for symbol in symbols:
     print(letter + number + symbol)
If len(letters) == 3, len(numbers) == 3, len(symbols) == 3. If you think of this as loop indexes "i", "j" and "k" and they begin at 000 and go up in kind of base 3 001, 001, 002, 010, 011, 012, 020, 100 and so on.

The formula for "i", "j" and "k" is

  N = self.N
  for index, item in enumerate(self.sets):
    N, r = divmod(N, len(item))
    combo.append(r)
So the self.N is the product number of the nested loops, you can increment this by 1 each time to get each product of the nested loops.

We can load balance the loops and schedule the N to not starve the outer loops. We can independently prioritize each inner loop. If you represent your GUI or backend as extremely nested loops, you can write easy to understand code with one abstraction, by acting as if loops were independent processes.

So rather than that sequence, we can generate 000, 111, 222, 010, 230, 013 and so on.

Load balanced loops means you can progress on multiple items simultaneously, concurrently. I combined concurrent loops with parallel loops with multithreading. I plan it to be a pattern to create incredibly reactive frontends and backends with low latency to processing. Many frontends lag when encrypting, compressing or resizing images, it doesn't need to be that slow. IntelliJ doesn't need to be slow with concurrent loops and multithreading.

See https://github.com/samsquire/multiversion-concurrency-contro...

Loops are rarely first class citizens in most languages I have used and I plan to change that.

I think I can combine your inspiration of gray codes with this idea.

I think memory layout and data structure and algorithm can be 3 separate decisions. I am yet to see any developers talk of this. Most of the time the CPU is idling waiting for memory, disk or network. We can arrange and iterate data to be efficient.



I decided to try implement get_gray_xy. I am wondering how you can generalise it to any number of loops and apply it to nested loops of varying sizes, that is if you have three sets and the sizes are 8, 12, 81.

Wikipedia says graycodes can be found by num ^ (num >> 1)

  a = [1, 2, 3, 4]
  b = [2, 4, 8, 16]
  indexes = set()
  correct = set()

  print("graycode loop indexes")
  for index in range(0, len(a) * len(b)):
    code = index ^ (index >> 1)
    left = code & 0x0003
    right = code >> 0x0002 & 0x0003
  
    print(left, right)
    indexes.add((left, right))
  
  assert len(indexes) == 16
  
  print("regular nested loop indexes")
  
  for x in range(0, len(a)):
    for y in range(0, len(b)):
      correct.add((x,y))
      print(x, y)
  
  
  assert correct == indexes
This produces the sequence:

    graycode loop indexes
    0 0
    1 0
    3 0
    2 0
    2 1
    3 1
    1 1
    0 1
    0 3
    1 3
    3 3
    2 3
    2 2
    3 2
    1 2
    0 2
    regular nested loop indexes
    0 0
    0 1
    0 2
    0 3
    1 0
    1 1
    1 2
    1 3
    2 0
    2 1
    2 2
    2 3
    3 0
    3 1
    3 2
    3 3


Here is the implementation of get_gray_xy that I used:

    static uint64_t deinterleave(uint64_t x) {
        x = x & 0x5555555555555555;
        x = (x | (x >>  1)) & 0x3333333333333333;
        x = (x | (x >>  2)) & 0x0f0f0f0f0f0f0f0f;
        x = (x | (x >>  4)) & 0x00ff00ff00ff00ff;
        x = (x | (x >>  8)) & 0x0000ffff0000ffff;
        x = (x | (x >> 16)) & 0x00000000ffffffff;
        return x;
    }
    static void get_gray_xy(uint64_t n, uint64_t *x, uint64_t *y) {
        uint64_t gray = n ^ (n >> 1);
        *x = deinterleave(gray);
        *y = deinterleave(gray >> 1);
    }
I don't think the gray code solution can work well for sets of different sizes because the area of indexes that you go through grows as a square.

But it does work for different number of dimensions or different number of nested loops. For 3 dimensions each coordinate is constructed from every 3rd bit instead of 2nd.

So if you have index 14, then gray code is 9 (1001) which means in two dimensions x=1,y=2 and in three dimensions, x=3,y=0,z=0.


Thank you for sharing your idea and your code and thoughts and thanks for your reply.

I would like to generalise it for sets of different sizes. Maybe it's something different to a gray codes which you taught me today, it feels that it should be simple.

Maybe someone kind can step in and help me or I can ask it on Stackoverflow.

I need to think it through.

At one point in time I was reading a section on Intel's website of cache optimisation using a tool that shows you the cache behaviour of the CPU and column or row major iteration it had a GUI. I found it very interesting but I cannot seem to find it at this time. I did find some medium article of Cachediff.


deinterleave doesn't actually produce gray code does it? The GP said to convert to gray code before doing the deinterleave.


> but the algorithm didn't get faster

Could it be branch predictor at play? https://stackoverflow.com/questions/11227809/why-is-processi...





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: