Data formats like Parquet/Arrow and DataFusion are optimised for high speed read...

Data formats like Parquet/Arrow and DataFusion are optimised for high speed read/write (and the processing) of large amounts of data, which is generally what you’re going to be using them for.

Additionally, as others have mentioned, clustered processing for larger-than-machine/ram datasets is a bit easier to manage compared to setting up a database cluster.

Another benefit is ephemeral-compute: we have Kubernetes cluster, and a particular message in a Kafka topic can kickstart a a spark job across several machines in the cluster (possibly causing auto scaling) which processes the x-TB’s of data it needs to, writes the results out and then finishes. Faster, cheaper and more suited than keeping a multi-node db cluster going.

Also lets us run non-SQL stages with less bottlenecks: bulk ML scoring, bulk data enrichment, etc.