Any good resources on where GPU DBs offer significant wins? Especially, but not ...

dragontamer · on Jan 29, 2019

I don't know about practical benchmarks yet, but GPUs have superior parallelism and superior memory bandwidth compared to CPUs.

Classical "join" patterns, like merge-join, hash-join, and other such algorithms, have been run on GPUs with outstanding efficiency. Taking advantage of the parallelism.

A GPU merge-join is a strange beast though. See here for details: https://moderngpu.github.io/join.html . So its a weird algorithm, but it is clearly far more efficient than anything a traditional CPU could ever hope to accomplish.

In any case, it is clear that the high-memory bandwidth of GPUs, coupled with its parallel processors, makes the GPU a superior choice over CPUs for the relational merge-join algorithm.

GPUs probably can't be used effectively in all database operations. But they can at LEAST be used efficiently in "equijoin table", one of the most fundamental and common operations of a modern database. (How many times have you seen "JOIN Table1.id == Table2.id"??) As such, I expect GPUs to eventually be a common accelerator for SQL Databases. The only problem now is building the software to make this possible.

arnon · on Jan 29, 2019

At SQream (a GPU-accelerated data warehouse) we use the GPU for many operations, including sorting, aggregating, joining, transformations, projections, etc.

We augment the GPU algorithms with external CPU algorithms when the GPU implementations aren't ideal or can't run due to memory constraints.

We recently talked about this at a session at CMU's Database Group, if you're interested - https://www.youtube.com/watch?v=bSYFwd50EcU

jandrewrogers · on Jan 29, 2019

GPU databases are brilliant for cases where the working set can live entirely within the GPU's memory. For most applications with much larger (or more dynamic) working sets, the PCIe bus becomes a significant performance bottleneck. This is their traditional niche.

That said, I've heard anecdotes from people I trust that heavily optimized use of CPU vector instructions is competitive with GPUs for database use cases.

dragontamer · on Jan 30, 2019

> GPU databases are brilliant for cases where the working set can live entirely within the GPU's memory.

Probably true for current computer software. But there are numerous algorithms that allow groups of nodes to work together on database-joins, even if they don't fit in one node.

Consider Table A (1000 rows), and Table B (1,000,000 rows). Lets say you want to compute A Join B, but B doesn't fit in your memory (lets say you only have room for 5000 rows). Well, you can split Table B into 250 pieces, each with 4000-rows.

TableA (1000 rows) + TableB (4000 rows) is 5000 rows, which fits in memory. :-)

You then compute A join B[0:4000], then A join B[4000:8000], etc. etc. In fact, all 250 of these joins can be done in parallel.

----------

As such, its theoretically possible to perform database joins on parallel systems, even if they don't fit into any particular node's RAM.

shaklee3 · on Jan 30, 2019

If you can afford it, much of the penalty from the pcie bus goes away if you have a system with nvlink. You still need to transfer the data back to the CPU for the final results, but most of the filtering and reduction operations across GPU memory can be done on nvlink only.

fulafel · on Jan 30, 2019

There's also the iGPU case where the GPU is on the system DRAM bus.

elanora96 · on Jan 30, 2019

I'm clueless on the amount of bandwidth needed for larger applications, will the eventual release of PCIe 4 & 5 have a big impact on this? Or will it still be too slow?

vkaku · on Jan 30, 2019

PCIe3 x16, current generation has a bandwidth of 15.75 GB/s (×16)

Assuming you have 16 GB or RAM on the GPU, theoretically ~1 second is all it would take to load the GPU with that amount of data. Unfortunately, when you take into consideration huge data sets, you'll be able to saturate that with 5 M2 SSDs each running at 3200 MB/s, assuming Disk<-RAM DMA->GPU. Those would also require at least 5 PCIe 2x8 ports on a pretty performant setup. RAM bandwidth is assumed to be around ~40-60GB/s, so hopefully no bottlenecks there.

This is assuming your GPU could swizzle through 16GB of data in a second. They have a theoretical memory bandwidth of between 450-970 GB/s.

Now realistically, per vendor marketing manuals, the fastest DB I've seen allows one to ingest at 3TB / hour ~~ 1GB / second.

So there must be more to it than the theoretically 16GB / second business. PCIe4 x16 doubles the speed to 32GB/s but at this time it looks pointless to me.

arnon · on Jan 30, 2019

NVLink, which works between IBM Power CPUs and NVIDIA GPUs is already 9.5x faster than PCIe 3.0...

Not all queries or database operations are bottlenecked by PCIe. Some are compute bound, others are network bound, etc.

eptcyka · on Jan 30, 2019

I can't imagine a modern system where the link between system memory and the CPU would be slower than the PCI link between memory and a peripheral.

scribu · on Jan 29, 2019

This page [1] lists query times on a 1.1 billion row dataset, performed using a variety of systems, from GPU DBs all the way to SQLite.

[1] https://tech.marksblogg.com/benchmarks.html

Edit: Unfortunately, each run is on different hardware, but it at least gives you an idea of what's possible.

walterbell · on Jan 29, 2019

Anyone used this GPU database that ranks at the top of that list, above the open-source MapD?

https://www.brytlyt.com

datumdadum · on Feb 1, 2019

Haven't, but it's worth noting that hardware is probably attributable to them edging out Mapd since they're on a 5-node minsky cluster featuring nvlink, hence as arnon said, are benefiting from 9.5x faster transfer from disk than than PCIe 3.0. That blog has not yet tested Mapd's IBM Power version-- would be interesting to see how it compared on that cluster.

CMU has an interesting lecture if you want to learn more about Brytlyt https://www.youtube.com/watch?v=oL0IIMQjFrs

Couple interesting things to note:

- They attain some of their speed by requiring that data be pre-sorted https://youtu.be/oL0IIMQjFrs?t=1092 https://youtu.be/oL0IIMQjFrs?t=1127

- They've built their database on Postgres for query planning, but for any query which does not match what they've accelerated on GPU, they do not have the ability to failover to utilizing postgres on the CPU. https://youtu.be/oL0IIMQjFrs?t=3260

- Data is brought into GPU memory at table CREATE time, so the cost of transferring data from disk->host RAM->GPU RAM is not reflected. Probably wouldn't work if you want to shuffle data in/out of GPU RAM across changing query workloads. https://youtu.be/oL0IIMQjFrs?t=1310

- The blog had to use data type DATE instead of DATETIME for Brytlyt since it doesn't support the latter. DATETIME was used for the other DBs, which is a heavier computation. https://tech.marksblogg.com/billion-nyc-taxi-rides-p2-16xlar...

So all-in-all it seems like a more carefully constructed, hardware balanced comparison would be needed to see which the quickest would be.

bufferoverflow · on Jan 30, 2019

Note that it's at the top of the list probably because it's running on a cluster. It would be awesome to see such a comparison on some standard hardware, like a large AWS GPU instance (eg1.2xlarge).

Also note that the dataset is 600GB, so it won't fit a sinlge GPU, not even close.

arnon · on Jan 29, 2019

Also worth noting this is a dataset that fits comfortably in-memory

Rapzid · on Jan 30, 2019

And the Postgres run was on 16GB of RAM and a rather slow SSD in a single drive configuration. Would have been interesting to see the results of either in memory or on a faster storage system.

sacheendra · on Jan 30, 2019

The cost of GPUs doesn't make sense for the compute they offer.

According the benchmark, the fastest 8 GPU node takes about 0.5 seconds. The cost of that node on AWS is about 24$/hour. The 21 node spark cluster takes 6 seconds. But, it only costs 4$/hour.

An additional benefit with Spark is that it can be used for a lot more variety of operations than a GPU.

This cost disadvantage restricts GPU processing to niche use cases.

TomVDB · on Jan 30, 2019

> According the benchmark, the fastest 8 GPU node takes about 0.5 seconds. The cost of that node on AWS is about 24$/hour. The 21 node spark cluster takes 6 seconds. But, it only costs 4$/hour.

Using your numbers, the GPU solution has half the cost for similar performance? How does that not make sense?

> This cost disadvantage restricts GPU processing to niche use cases.

All GPU compute applications are niche use cases.

johnvanommen · on Jan 30, 2019

Look man, you don't get it. The GPU case is half the cost, but it's also twelve times faster.

Oh, wait...

latchkey · on Jan 30, 2019

> The cost of GPUs doesn't make sense for the compute they offer.

This assumes AWS pricing. You build a farm of GPUs and buy in bulk, you get much better cost basis. GPU farms are becoming more and more of a thing now and definitely less 'niche'.

dominotw · on Jan 29, 2019

CMU has a lecture series on youtube on GPU databases.

https://db.cs.cmu.edu/seminar2018/

remus · on Jan 29, 2019

Im a bit out of date, but lots of databases are moving towards an in-memory model (or in-memory feature sets) which means that hard drive access times aren't a bottleneck. The AWS EC2 instances you see with 1TB+ of RAM are generally aimed at this sort of thing.

Presumably once you have all your data in memory then the CPU becomes a bottleneck again, and if you can ship out the number crunching to GPUs in an efficient manner (i.e. you don't waste loads of time shuffling data between RAM and GPU) then you'll see performance gains.

alfalfasprout · on Jan 29, 2019

Yeah, but the very shipping of data back and forth to the GPU is usually a bottleneck no matter how clever you get. Moreover, you're limited to say 8 GPUs per box for a total of 100GB-ish in memory. You can operate on nearly 10TB on the largest AWS instances using CPUs. With AVX512 intrinsics, this translates into some serious potential on large in-memory datasets that renders GPUs less appealing.

dragontamer · on Jan 30, 2019

As long as you are doing something more complicataed than O(n), the shipping of data is going to be negligible.

In fact, that's why sorting on a GPU is going to almost always be worthwhile. Sorting is a O(n*log(n)) operation, but the transfer is O(n).

The Table-Join is O(n^2) if I remember correctly per join. (If you have 5 tables to join, its a ^2 factor for each one). That means shipping the data (a O(n) operation) is almost always going to be negligible.

So I'd expect both join and sort to be pushed towards GPUs, especially because both joins and sorts have well known parallel algorithms.

jhj · on Jan 30, 2019

No, it's the arithmetic intensity of the operation in a roofline model sense [1] that indicates whether or not the GPU is worth it, and whether you are in a memory bandwidth or compute bound regime.

Asymptotic algorithmic complexity for a serial processor is meaningless here, it provides no indication on how a parallel machine (e.g., a PRAM or other, or perhaps trying to map it to the GPU's unique SIMD model) will perform on the problem.

The arithmetic intensity per byte of memory loaded or stored for sort or join is low. You can exploit memory reuse when data is loaded in the register file for sorting (for a larger working set), but you can do that for sorting on SIMD CPUs in any case with swizzle instructions (e.g., bitonic sorting networks). The GPU is only worth it to exploit the higher memory bandwidth to global memory here, if you're comparing a single CPU to a single GPU.

[1] https://en.wikipedia.org/wiki/Roofline_model

dragontamer · on Jan 30, 2019

> Asymptotic algorithmic complexity for a serial processor is meaningless here

Its not meaningless. Its a legitimate cap. Bitonic sort for example is O(nlog^2(n)) comparisons, which is more comparisons than O(nlog(n)) for a classical mergesort.

-----------

Let me use a more obvious example.

Binary Searching a sorted array should almost NEVER be moved to the GPU. Why? Because it is a O(log(n)) operation, while Memory-transfers is a O(n) operation. In effect: it is more complex to transfer the array than to perform a binary search.

Only if you plan to search the array many, many, many times (such as in a Relational Join operator), will it make sense to transfer the data to the GPU.

-------

In effect, I'm using asympotic complexity to demonstrate that almost all algorithms of complexity O(n) or faster are simply not worth it on a GPU. The data-transfer is a O(n) step. Overall work complexity greater than O(n) seems to benefit GPUs in my experience.

> The GPU is only worth it to exploit the higher memory bandwidth to global memory here, if you're comparing a single CPU to a single GPU.

GPUs have both higher-ram and more numerous arithmetic structures than a CPU.

The $400 Vega64 has HBM2 x2 stacks, for 500GBps to VRAM. It has 4096 shaders which provide over 10TFlops of compute girth.

A $800 Threadripper 2950x has 16core / 32-threads. It provides 4x DDR4 memory controllers for 100GBps bandwidth to VRAM and 0.5 TFlops of compute.

Arithmetic intensity favors GPUs. Memory-intensity favors GPUs. The roofline model says GPUs are better on all aspects. Aka: its broken and wrong to use this model :-)

GPUs are bad at thread divergence. If there's an algorithm with high divergence (ie: Minimax Chess algorithm), it won't be ported to a GPU easily. But if you have a regular problem (sorting, searching, matrix multiplication, table joins, etc. etc.), they tend to work very well on GPUs.

Many, many algorithms have not been ported to GPUs yet however. That's the main downside of using them. But it seems like a great many number of algorithms can in fact, be accelerated with GPUs. I just read a paper that accelerated linked-list traversals on GPUs for example (!!!). It was a prefix-sum over linked lists.

alfalfasprout · on Jan 30, 2019

You can't use 10TFlops of compute on a GPU if you can't even feed it data quickly enough. The state of the art for throughput is Nvidia's NVLink and you're capped to a theoretical max of 160GB/sec. Given how trivial most analytics workloads are (computing ratios, reductions like sums, means, and variances, etc.) there's simply no way you're going to effectively max out the compute available on a GPU.

Searching sorted arrays is actually very common in these workloads. Why? Analytics workloads typically operate on timestamped data stored in a sorted fashion where you have perfect or near perfect temporal and spatial locality. Thus even joins tend to be cheap.

With Skylake and AMD Epyc nearing in on 300GB/sec and much better cost efficiency per GB of memory vs. GPU memory the case for GPUs in this application seems dubious.

I will grant you that GPUs have a place in more complex operations like sorts and joins with table scans. They also blow past CPUs when it comes to expensive computations on a dataset (where prefetching can mask latencies nicely).

dragontamer · on Jan 30, 2019

Yeah, it will definitely depend on the workload.

A good example of a dense sort + join GPU workload would be looking for "Cliques" of Twitter or Facebook users. A Clique of 3 would be three users, A, B, and C, where A follows B, B follows C, and C follows A.

You'd perform this analysis by performing two joins: the follower-followee table on itself three times.

----------

So it really depends on your workload. But I can imagine that someone who is analyzing these kinds of tougher join operations would enjoy GPUs to accelerate the task.

c2h5oh · on Jan 30, 2019

Pcie 3.0 x16 can push almost 16GB/s in each direction. Add 4-6 nvme drives to deliver that much for a total of 32 - 40 pcie lanes. Not really an option on Intel platform that tops at 48 lanes per cpu.. it makes a bit more sense with 128 lanes on AMD epyc.

Alternatively you can saturate pcie lanes with gpus and load data from ram

PeterCorless · on Jan 30, 2019

Linux can do up to 16 GPUs.

arnon · on Jan 29, 2019

GPUs have a very high memory bandwidth, and can be used to perform memory-intensive operations (think decompression, for example).

You can load compressed data up to the GPU, decompress it there, and run very complicated mathematical functions. This can be very beneficial when you run a JOIN operation.

See a comparison here, with SQream DB (a GPU accelerated data warehouse) vs. some unnamed data warehousing solutions - https://sqream.com/why-im-not-afraid-of-sql-joins/