Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They're different access patterns, though. Are there no concerns about performance and potentially blocking behavior? Decoupling OLTP and analytics is frequently done with good reason: 1/to allow the systems to scale independently, and 2/to help prevent issues with one component from impacting the other (i.e., contain blast radius). I wouldn't want a failure of my search engine to also take down my transaction system.


You don't need to. Customers usually deploy us on a standalone replica(s) on their Postgres cluster. If a query were to take it down, it would only take down the replica(s) dedicated to ParadeDB, leaving the primary and all other read replicas dedicated to OLTP safe.


Are you saying that the cluster isn't homogenous? It sounds like you're describing an architecture that involves a cluster that has two entirely different pieces of software on it, and whose roles aren't interchangeable.


Bear with me, this will be a bit of a longer answer. Today, there are two topologies under which people deploy ParadeDB.

- <some managed Postgres service> + ParadeDB. Frequently, customers already use a managed Postgres (e.g. AWS RDS) and want ParadeDB. In that world, they maintain their managed Postgres service and deploy a Kubernetes cluster running ParadeDB on the side, with one primary instance and some number of replicas. The AWS RDS primary sends data to the ParadeDB primary via logical replication. You can see a diagram here: https://docs.paradedb.com/deploy/byoc

In this topology, the OLTP and search/OLAP workloads are fully isolated from each other. You have two clusters, but you don't need a third-party ETL service since they're both "just Postgres".

- <self-hosted Postgres> + ParadeDB. Some customers, typically larger ones, prefer to self-host Postgres and want to install our Postgres extension directly. The extension is installed in their primary Postgres, and the CREATE INDEX commands must be issued on the primary; however, they may route reads only to a subset of the read replicas in their cluster.

In this topology, all writes could be directed to the primary, all OLTP read queries could be routed to a pool of read replicas, and all search/OLAP queries could be directed to another subset of replicas.

Both are completely reasonable approaches and depend on the workload. Hope this helps :)


Which of these two is the higher order bit?

* ParadeDB speaks postgres protocol

* These setups don't have a complex ETL pipeline

If you have a ETL pipeline specialized for PG logical replication (as opposed to generic JVM based Debizium/Kafka setups), you get some fraction of the same benefits. I'm curious about Conduit and its postgres plugin.

That leaves: ParadeDB uses vanilla postgres + rust extension. This is a technology detail. I was looking for an articulation of the customer benefit because of this technologically appealing architecture.


The value prop for customers vs Elasticsearch are:

- ACID w/ JOINs

- Real-time indexing under UPDATE-heavy workloads. Instacart wrote about this, they had to move away from Elasticsearch during COVID because of this problem: https://tech.instacart.com/how-instacart-built-a-modern-sear...

Beyond these two benefits, then the added benefits are:

- Infrastructure simplification (no need for ETL)

- Lower costs

Speaking the wire protocol is nice, but it's not worth much.


they both sound like postgres to me, just with different extensions


Since we both worked there: I can think of a few places at Segment where we'd have added more reporting/analytics/search if it weren't such a pain to set up a OLAP copy of our control plane databases. Remember how much engineering effort we spent on teams that did nothing but control plane database stuff?

Data plane is a different story, but not everything is 1m+ RPS.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: