I’m making a FOSS analytics system, and ease-of-installation is important. I want to send event data to a separate SQLite database, to keep analytics data separate from the main app’s data.
I’m concerned about scaling, since even a modestly busy website could have 1000+ events per second.
My thought is to store events in memory on the server and then make one batched write every second.
Does this seem like a reasonable way to get around the SQLite limitation where it struggles with lots of DB writes? Any better ideas?
I'd say that's a very reasonable and very good idea. I've implemented similar ideas in many other systems. Batching in general always reduces the per-entry overhead. It's easy to prove this via a benchmark. You can even put all the batched items in a single transaction. And since you are batching, you essentially will always have one thread that's getting the batch and actually doing the write, perfectly matched with SQLite's limitation of one concurrent write at a time.
What you need to worry about is slightly higher complexity: (1) what happens when a single batched write doesn't complete within one second; (2) what is the size of queue you store events in memory and whether it is unbounded or not; (3) if it is unbounded are you confident that overloading the server won't cause it to be killed by OOM (queueing theory says when the arrival rate is too high the queue size becomes infinite so there must be another mechanism to push back), and if it is bounded are you comfortable with dropping entries; (4) if you do decide to drop entries from a bounded queue, which entries you drop; (5) for a bounded queue what its limit is. These are very necessary questions that arise in almost every system that needs queueing. Thinking about these questions not only help you in this instance, but also in many other future scenarios you may encounter.
SQLite doesn't struggle with writes. But it only supports a single write transaction at a time; if you don't trust SQLite's transaction concurrency performance, you might serialize all your writes on a specific thread/process.
Since analytics data is generally write-heavy, I would recommend to use ClickHouse. You can use async-insert[0] feature of ClickHouse, thus you don't need to worry about batching events on your side. If you are looking for an embedded solution, you can use chDB which is built on top of ClickHouse.
Batching writes is probably a good idea, but by far the absolute best way to do something like this with SQLite is to use WAL and have a single designated writer (and as many readers as you want), probably fed by something like a queue. As long as you do that, I usually find the the performance is often really amazing.
I do something similar for an audit trail at work. I work with the type of data where we may need to know who looked at what and when. All those records are stored in a separate SQLite DB (main DB is postgres), and I cycle it out once per calendar year. That makes archival trivial, and should a compliance person need to look at it a simple desktop app can open the file easily.
You can't beat SQLite for ease of use. I'd try it out and simulate some load to see if SQLite can keep up, if you keep your inserts simple I bet it can.
You can write large SQLite databases at disk speed with the right data model and schema.
What most people do as mentioned is either batch writes, or simply send them over some kind of channel to a single thread that is the designated writer, some kind of MPSC queue, and that queue effectively acts as a serialization barrier.
Either can work depending on your latency/durability requirements.
You also absolutely need to use the WAL journaling mode which allows concurrent reads/writes at the same time, N readers but only 1 writer, and you probably want to take a hard look at disabling synchronous mode, which forces SQLite to fsync everywhere all the time. In practice this sounds bad but consider your example: if you make one batched write every second, then there is always a 1-second window where data can be lost anyway. There's always a "window" where uncommitted data can be lost, it's mostly a matter of how small that window is, and if internal consistency of the system is preserved in face of that failed write.
In your case the lack of synchronous mode wouldn't really be that bad because your typical "loss window" would be much greater than what it implies. At the same time, turning off synchronous mode can give you an order of magnitude performance increase. So it's very well worth thinking about.
TL;DR use a single thread to serialize writes (or batch writes), enable WAL mode, and think about synchronous mode. If you do these things you can hit extremely fast write speeds quite easily.
I’m making a FOSS analytics system, and ease-of-installation is important. I want to send event data to a separate SQLite database, to keep analytics data separate from the main app’s data.
I’m concerned about scaling, since even a modestly busy website could have 1000+ events per second.
My thought is to store events in memory on the server and then make one batched write every second.
Does this seem like a reasonable way to get around the SQLite limitation where it struggles with lots of DB writes? Any better ideas?