I don't buy this "object storage + Iceberg is all you need for OLAP" hype. If the application has sustained query load, it makes sense to provision servers rather than go pure serverless. If there are already provisioned servers, it makes sense to cache data on them (either in their memory or SSDs) to avoid round-trips to object storage. This is the architecture of the latest breed of OLAP databases such as Databend and Firebolt, as well as the latest iterations of Redshift's and Snowflake's architecture. Also, this is the approach of the newest breed of vector databases, such as Turbopuffer.
For OLAP use cases with real-time data ingestion requirements, object-storage-only approach also leads to write amplification. Therefore, I don't think that architectures like Apache Pinot, Apache Paimon, and Apache Druid are going anywhere.
Another problem with "open table formats" like Iceberg, Hudi, and Delta Lake is their slow innovation speed.
I buy into some of the hype. but only for majority use cases, not for edge case specialisation.
As ever, at FANGMA-whatever scale/use cases, yeah I’d agree with you. But the majority of cases are not FANGMA-whatever scale/use cases.
Basically, it’s good enough for most people. Plus it takes away a bunch of complexity for them.
> If the application has sustained query load
Analytical queries in majority cases are not causing sustained load.
It’s a few dashboards a handful of managers/teams check a couple of times through the day.
Or a few people crunching some ad hoc queries (and hopefully writing the intermediate results somewhere so they don’t have to keep making the same query — I.e. no sustained load problem).
> real-time data ingestion requirements
Most of the time a nightly batch job is good enough. Most Businesses still work on day by day or week by week, and that’s at the high frequency end of things.
> slow innovation speed
Most people don’t want bleeding edge innovative change. They want stability.
Data engineers have enough problems with teams changing source database fields without telling us. We don’t need the tool we’re storing the data with to constantly break too.
For OLAP use cases with real-time data ingestion requirements, object-storage-only approach also leads to write amplification. Therefore, I don't think that architectures like Apache Pinot, Apache Paimon, and Apache Druid are going anywhere.
Another problem with "open table formats" like Iceberg, Hudi, and Delta Lake is their slow innovation speed.
I've recently argued about this at greater length here: https://engineeringideas.substack.com/p/the-future-of-olap-t...