Yeah, but the very shipping of data back and forth to the GPU is usually a bottl...

dragontamer · on Jan 30, 2019

As long as you are doing something more complicataed than O(n), the shipping of data is going to be negligible.

In fact, that's why sorting on a GPU is going to almost always be worthwhile. Sorting is a O(n*log(n)) operation, but the transfer is O(n).

The Table-Join is O(n^2) if I remember correctly per join. (If you have 5 tables to join, its a ^2 factor for each one). That means shipping the data (a O(n) operation) is almost always going to be negligible.

So I'd expect both join and sort to be pushed towards GPUs, especially because both joins and sorts have well known parallel algorithms.

jhj · on Jan 30, 2019

No, it's the arithmetic intensity of the operation in a roofline model sense [1] that indicates whether or not the GPU is worth it, and whether you are in a memory bandwidth or compute bound regime.

Asymptotic algorithmic complexity for a serial processor is meaningless here, it provides no indication on how a parallel machine (e.g., a PRAM or other, or perhaps trying to map it to the GPU's unique SIMD model) will perform on the problem.

The arithmetic intensity per byte of memory loaded or stored for sort or join is low. You can exploit memory reuse when data is loaded in the register file for sorting (for a larger working set), but you can do that for sorting on SIMD CPUs in any case with swizzle instructions (e.g., bitonic sorting networks). The GPU is only worth it to exploit the higher memory bandwidth to global memory here, if you're comparing a single CPU to a single GPU.

[1] https://en.wikipedia.org/wiki/Roofline_model

dragontamer · on Jan 30, 2019

> Asymptotic algorithmic complexity for a serial processor is meaningless here

Its not meaningless. Its a legitimate cap. Bitonic sort for example is O(nlog^2(n)) comparisons, which is more comparisons than O(nlog(n)) for a classical mergesort.

-----------

Let me use a more obvious example.

Binary Searching a sorted array should almost NEVER be moved to the GPU. Why? Because it is a O(log(n)) operation, while Memory-transfers is a O(n) operation. In effect: it is more complex to transfer the array than to perform a binary search.

Only if you plan to search the array many, many, many times (such as in a Relational Join operator), will it make sense to transfer the data to the GPU.

-------

In effect, I'm using asympotic complexity to demonstrate that almost all algorithms of complexity O(n) or faster are simply not worth it on a GPU. The data-transfer is a O(n) step. Overall work complexity greater than O(n) seems to benefit GPUs in my experience.

> The GPU is only worth it to exploit the higher memory bandwidth to global memory here, if you're comparing a single CPU to a single GPU.

GPUs have both higher-ram and more numerous arithmetic structures than a CPU.

The $400 Vega64 has HBM2 x2 stacks, for 500GBps to VRAM. It has 4096 shaders which provide over 10TFlops of compute girth.

A $800 Threadripper 2950x has 16core / 32-threads. It provides 4x DDR4 memory controllers for 100GBps bandwidth to VRAM and 0.5 TFlops of compute.

Arithmetic intensity favors GPUs. Memory-intensity favors GPUs. The roofline model says GPUs are better on all aspects. Aka: its broken and wrong to use this model :-)

GPUs are bad at thread divergence. If there's an algorithm with high divergence (ie: Minimax Chess algorithm), it won't be ported to a GPU easily. But if you have a regular problem (sorting, searching, matrix multiplication, table joins, etc. etc.), they tend to work very well on GPUs.

Many, many algorithms have not been ported to GPUs yet however. That's the main downside of using them. But it seems like a great many number of algorithms can in fact, be accelerated with GPUs. I just read a paper that accelerated linked-list traversals on GPUs for example (!!!). It was a prefix-sum over linked lists.

alfalfasprout · on Jan 30, 2019

You can't use 10TFlops of compute on a GPU if you can't even feed it data quickly enough. The state of the art for throughput is Nvidia's NVLink and you're capped to a theoretical max of 160GB/sec. Given how trivial most analytics workloads are (computing ratios, reductions like sums, means, and variances, etc.) there's simply no way you're going to effectively max out the compute available on a GPU.

Searching sorted arrays is actually very common in these workloads. Why? Analytics workloads typically operate on timestamped data stored in a sorted fashion where you have perfect or near perfect temporal and spatial locality. Thus even joins tend to be cheap.

With Skylake and AMD Epyc nearing in on 300GB/sec and much better cost efficiency per GB of memory vs. GPU memory the case for GPUs in this application seems dubious.

I will grant you that GPUs have a place in more complex operations like sorts and joins with table scans. They also blow past CPUs when it comes to expensive computations on a dataset (where prefetching can mask latencies nicely).

dragontamer · on Jan 30, 2019

Yeah, it will definitely depend on the workload.

A good example of a dense sort + join GPU workload would be looking for "Cliques" of Twitter or Facebook users. A Clique of 3 would be three users, A, B, and C, where A follows B, B follows C, and C follows A.

You'd perform this analysis by performing two joins: the follower-followee table on itself three times.

----------

So it really depends on your workload. But I can imagine that someone who is analyzing these kinds of tougher join operations would enjoy GPUs to accelerate the task.

c2h5oh · on Jan 30, 2019

Pcie 3.0 x16 can push almost 16GB/s in each direction. Add 4-6 nvme drives to deliver that much for a total of 32 - 40 pcie lanes. Not really an option on Intel platform that tops at 48 lanes per cpu.. it makes a bit more sense with 128 lanes on AMD epyc.

Alternatively you can saturate pcie lanes with gpus and load data from ram

PeterCorless · on Jan 30, 2019

Linux can do up to 16 GPUs.