General SQLite question for the group… I’m making a FOSS analytics system, and e...

kccqzy · on Sept 11, 2024

I'd say that's a very reasonable and very good idea. I've implemented similar ideas in many other systems. Batching in general always reduces the per-entry overhead. It's easy to prove this via a benchmark. You can even put all the batched items in a single transaction. And since you are batching, you essentially will always have one thread that's getting the batch and actually doing the write, perfectly matched with SQLite's limitation of one concurrent write at a time.

What you need to worry about is slightly higher complexity: (1) what happens when a single batched write doesn't complete within one second; (2) what is the size of queue you store events in memory and whether it is unbounded or not; (3) if it is unbounded are you confident that overloading the server won't cause it to be killed by OOM (queueing theory says when the arrival rate is too high the queue size becomes infinite so there must be another mechanism to push back), and if it is bounded are you comfortable with dropping entries; (4) if you do decide to drop entries from a bounded queue, which entries you drop; (5) for a bounded queue what its limit is. These are very necessary questions that arise in almost every system that needs queueing. Thinking about these questions not only help you in this instance, but also in many other future scenarios you may encounter.

tptacek · on Sept 11, 2024

SQLite doesn't struggle with writes. But it only supports a single write transaction at a time; if you don't trust SQLite's transaction concurrency performance, you might serialize all your writes on a specific thread/process.

iambateman · on Sept 11, 2024

Thanks!

randomdrake · on Sept 12, 2024

Worked here. Changed my mind on SQLite. Hipp worked alongside us. SQLite can be crazy performant:

https://use.expensify.com/blog/scaling-sqlite-to-4m-qps-on-a...

pradeepchhetri · on Sept 12, 2024

Since analytics data is generally write-heavy, I would recommend to use ClickHouse. You can use async-insert[0] feature of ClickHouse, thus you don't need to worry about batching events on your side. If you are looking for an embedded solution, you can use chDB which is built on top of ClickHouse.

[0] https://clickhouse.com/blog/asynchronous-data-inserts-in-cli...

rpcope1 · on Sept 12, 2024

Batching writes is probably a good idea, but by far the absolute best way to do something like this with SQLite is to use WAL and have a single designated writer (and as many readers as you want), probably fed by something like a queue. As long as you do that, I usually find the the performance is often really amazing.

rodgerd · on Sept 11, 2024

Write events in parquet format and use DuckDB for your analytics?

notpushkin · on Sept 12, 2024

This. If you’re looking for something portable, DuckDB is hard to beat. It’s pretty much the SQLite for analytics.

I’ve had good experience with ClickHouse, too, but it feels a bit more like Postgres rather than SQLite in terms of portability.

tosh · on Sept 12, 2024

Take a look at chDB and clickhouse-local. Clickhouse can do a lot even without running it as a server.

https://clickhouse.com/docs/en/chdb

https://clickhouse.com/docs/en/operations/utilities/clickhou...

iambateman · on Sept 12, 2024

Thanks for this...I'd never heard of DuckDB.

I'm writing in PHP, so it looks like that's a nonstarter for now, but _very_ interested to see if a PHP extension pops up for this in the future.

pradeepchhetri · on Sept 12, 2024

There was a great talk[0] recently from Laravel core team member about ClickHouse, Probably you will enjoy watching it.

[0] https://www.youtube.com/watch?v=_jjvaFWWKqg

pradeepchhetri · on Sept 12, 2024

You should look at ClickHouse which has good PHP client

[0] https://github.com/smi2/phpClickHouse

tosh · on Sept 12, 2024

afaiu appending to parquet files is not cheap, you basically have to re-write the file.

Do you mean periodically flushing a queue of logs into a new parquet file (e.g. named with a timestamp)?

duffyjp · on Sept 11, 2024

I do something similar for an audit trail at work. I work with the type of data where we may need to know who looked at what and when. All those records are stored in a separate SQLite DB (main DB is postgres), and I cycle it out once per calendar year. That makes archival trivial, and should a compliance person need to look at it a simple desktop app can open the file easily.

You can't beat SQLite for ease of use. I'd try it out and simulate some load to see if SQLite can keep up, if you keep your inserts simple I bet it can.

hu3 · on Sept 11, 2024

For analytics it's hard to beat clickhouse in many aspects. It's quite portable too.

I'd suggest taking a good look at it.

aseipp · on Sept 11, 2024

You can write large SQLite databases at disk speed with the right data model and schema.

What most people do as mentioned is either batch writes, or simply send them over some kind of channel to a single thread that is the designated writer, some kind of MPSC queue, and that queue effectively acts as a serialization barrier.

Either can work depending on your latency/durability requirements.

You also absolutely need to use the WAL journaling mode which allows concurrent reads/writes at the same time, N readers but only 1 writer, and you probably want to take a hard look at disabling synchronous mode, which forces SQLite to fsync everywhere all the time. In practice this sounds bad but consider your example: if you make one batched write every second, then there is always a 1-second window where data can be lost anyway. There's always a "window" where uncommitted data can be lost, it's mostly a matter of how small that window is, and if internal consistency of the system is preserved in face of that failed write.

In your case the lack of synchronous mode wouldn't really be that bad because your typical "loss window" would be much greater than what it implies. At the same time, turning off synchronous mode can give you an order of magnitude performance increase. So it's very well worth thinking about.

TL;DR use a single thread to serialize writes (or batch writes), enable WAL mode, and think about synchronous mode. If you do these things you can hit extremely fast write speeds quite easily.

charleslmunger · on Sept 12, 2024

You don't need to turn off synchronous mode, but you can set it to NORMAL, which will only fsync when compacting the WAL file.

Another great pattern for batch writes is to use a system like Nagle's algorithm - push stuff into a queue, and insert from the queue in batches.