What is the underlying storage structure that scales to TB+?

pjscott · on Feb 19, 2020

Just regular B-trees for row and index storage. There's a reason why it's such a popular data structure!

https://www.sqlite.org/arch.html

v64 · on Feb 19, 2020

Not to mention many filesystems are implemented as B-trees.

bob1029 · on Feb 19, 2020

If you want TB+, you are talking about multiple computers (in context of indefinite scaling).

For most who browse HN that immediately means you reach for Postgres (and you would be wise to do so), but I would argue you can continue to use SQLite indefinitely (one instance per node) if you manage replication and transactions between nodes at your application layer.

I realize this is probably beyond what most developers want to screw with, but I have seen some extremely promising results in experimental implementations of this scheme. Being able to replicate items conditionally based directly on business needs is a very powerful capability on paper - E.g. User session data must be synchronously replicated to all nodes, while other entities can by asynchronously replicated based on their use cases. You could handle replication and transactions based upon any business factor or state if you go down this path.

chaz6 · on Feb 19, 2020

There is a derivative called rqlite which uses the raft consensus protocol to give you a highly fault-tolerant distributed database.

https://github.com/rqlite/rqlite

senderista · on Feb 19, 2020

It’s a great idea and I’m working on something similar, but Raft or any other quorum-based consensus protocol is a poor fit for the data plane.

antoinealb · on Feb 19, 2020

Can you expand on why you think this is a poor fit ? Is this because of added latency due to the multiple round trip involved ?

senderista · on Feb 20, 2020

It’s not latency, it’s throughput. Broadcast and incast network communication patterns in the LAN kill throughput, and are present in any quorum protocol. Atomic broadcast protocols like chain replication or LCR which use strictly sequential communication patterns (via chain or ring topologies) have much better throughput in LAN environments than Paxos/Raft (except for Ring Paxos, which requires IP multicast). In addition, because they outsource group membership and failure detection to a separate (likely Paxos-based) service, they require only f+1 members to tolerate f faults, rather than 2f+1 (see Lamport et al’s Vertical Paxos work). But they have two main downsides relative to Paxos/Raft-like protocols: latency grows linearly with cluster size and is bounded below by the slowest node (so they aren’t a good fit for high-latency WANs), and as noted above, they can’t “self-host” like Paxos/Raft.

PS Apparently my previous comment outraged the HN hive mind, judging by downvotes. I find this hilarious and revealing given that all I’ve said is obvious to any actual distsys researcher (ask eg the HyperDex or FawnKV folks about why they chose CR instead of Paxos). And at one large cloud provider I worked at, this was taken for granted back in 2012. Apparently “distsys” pop culture has yet to catch up.

senderista · on Feb 20, 2020

I should also mention that a great virtue of quorum-based protocols (for majority or basic read/write quorum systems) is their ability to mask tail latency/transient faults. That is difficult to replicate in sequential protocols and is a good reason to use quorum protocols in high/variable-latency and otherwise unreliable environments like geo-distributed databases.

Xorlev · on Feb 20, 2020

Yes, please do expand on this, since fast Paxos and similar are used for this in mission critical distributed databases for exactly this use case.

senderista · on Feb 20, 2020

I’m not sure how often Fast Paxos is really a good tradeoff, since in the LAN it has inferior throughput to the other protocols I mention below, while in the WAN >3/4 quorums expose you to additional tail latency, probably enough to negate the advantage of eliminating a single message delay if clients are distant from the cluster (see the Google SRE book for discussion).

bawolff · on Feb 20, 2020

At some point though, aren't you essentially reinventing a traditional db over top of sqlite?

The reason to use depencies is so you don't have to reinvent complicated wheels