Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Cron-based backup for SQLite (litestream.io)
370 points by davnicwil on May 15, 2022 | hide | past | favorite | 82 comments


Litestream author here. The motivation for this docs page was two-fold. First, laying out the different options with their trade-offs helps people understand what Litestream is trying to be so it helps keep my focus narrow. For example, if someone is looking for very high durability guarantees then they can see that rqlite might be a better fit and they don't try to fit Litestream into that mold.

Second, Litestream doesn't interfere with other backup methods so you can run it alongside a cron-based backup. I typically run both because I'm overly paranoid and because it's cheap.


Nice post — one suggestion would be to use zstd instead of gzip though. Out of the box it is much faster and will result in smaller files, and allows tuning to fit inside memory and cpu thresholds different use cases might prefer.

Friends don’t let friends use gzip :)


Thanks, I thought about using a different compression tool but everybody knows gzip so I figured I’d keep it simple. Litestream uses LZ4 internally (which is great) but it’s not something a lot of developers come across too often.


Storage is so cheap nowadays that usability and compression speed are more important for most people.


For a given compression ratio, zstd compression is faster than gzip by a good margin (even without using multiple threads). And zstd decompression is pretty much always faster than gzip too.


Here is a nice page for mapping "what first came to mind" to "what's actually fastest in 2021": https://jolynch.github.io/posts/use_fast_data_algorithms/

TL;DR, lz4 is 10x faster than gzip (though does not produce as small as a file), zstd is twice as fast and better.


I'm not saying that's the logic here, but this is a common pattern because SEO. It's a useful way to end up on potential customers' radar.


crontab isn’t a tool, cron is the tool. It reads a little strange because of that.


Good call. I updated the docs repo and the site should update in a few minutes.


Awesome!


A service that says 'here is an alternative to our service'? Right on their own website? Even though cron backups are basically 'competition' to litestream?

That's it. I'm giving fly.io a try next hobby app I start.

I'm so impressed with their business decisions right now, I was afraid for litestream but then I read how the creator is just hired to work on it full time. What a splendid development!


> A service that says 'here is an alternative to our service'?

I actually see this quite a bit. I think the reason for it is that it's a good way to show complexity or moving parts that can break in a process, where the marketing service is just a drop in and start. We can see the SQLite backup process here's and it's not too bad, but we can also see there are a few things that can go wrong that I'm sure litestream takes care of and would allow us to avoid any issues.

For example, backing up the DB onto the local disk before copying it could fail due to lack of disk space, and them we have to deal with the notification and fix for that. I'm sure litestream is a 10 minute setup that handles a lot of intricacies like that.


It's also in the same spirit as SQLite's own page on if SQLite is a good fit: https://www.sqlite.org/whentouse.html

There is something very powerful about knowing your strength, leaning into it, and building on it.


Don't give us that much credit! Litestream is a FOSS project. Fly.io isn't competitive about it. We just bought it because its existence, like, in the universe makes Fly.io more valuable, and we wanted to make sure Ben could keep working on it. We actually already have offerings that compete with it (Fly.io Postgres). :)

(Ben might feel competitive about Litestream though; it's his baby. So maybe give him some credit.)



Following the announcement of Fly.io acquiring the Litestream project and hiring its creator Ben Johnson I've been looking into it (and am very impressed by what I see!)

Reading through the docs I was delighted to find this full set of instructions including script commands for setting up a simpler regular snapshot backup using cron - for situations where Litestream is (quote) "overkill" - ie where the database is smaller and your durability requirements are lower.

In fact, the docs have a whole alternatives section for different requirements. I think this kind of thing is great, and something more projects should do! Wanted to share it because of this meta thing, but also because it's just pretty useful per se.


I felt the same finding that section: by not "pushing too much their solution" they actually increase the chances that I'll use it someday.


I agree. There are a bunch of benefits for everyone here. Users get to have a sharper view on the intended user case of the tool, this benefits bith sides, and learn more related bits along the way. It also builds trust.

The best marketing for devs is no marketing. Just being honest, clear and helpful is what works best.


This what a mature project looks like: it knows what it is, who it is for, and what it isn’t.

Contrast with so many open source projects that accumulate features and promise the world, so long as you give them a star.


> Contrast with so many open source projects that accumulate features and promise the world, so long as you give them a star.

What a cynical take.


if you run a saas, this is a good move for SEO too. People may be searching "backup sqlite cron" or something like it and realize they'd rather use litestream.


Agreed. In my mind, the fact that the developers understand limitations in their own tools means that future enhancements are likely to be more focussed and effective.


Don't forget to stop writing to the database while doing the backup, otherwise you can run into an infinite loop if you write faster than sqlite3 .backup is doing the backup :D

Learned that the hard way when implementing sqlite3 backups on Gladys Assistant ( open-source home automation platform https://github.com/GladysAssistant/Gladys )


Can you elaborate more? I think the cron-solution will be unable to synchronize with your application code to determine when to stop writing, so more background and your solution would be of interest.

As I understand it, while you do the backup, other writes should go to the WAL log and only get commited until after the backup?


I did some reading, and there are 2 ways to use the SQLite backup API:

1. call backup_init, backup_step with a step size of -1, then backup_finish. This will lock the db the whole time the backup is taking place and backup the entire db.

2. call backup_init, backup_step in a loop until it returns SQLITE_DONE with a positive step size indicating how many pages to copy, then backup_finish.

With method 2, no db lock is held between backup_step calls. If a write occurs between backup_step calls, the backup API automagically detects this and restarts. I don't know if it looks at the commit count and restarts the backup from the beginning or is smart enough to know the first changed page and restarts from there. Because the lock is released, a continuous stream of writes could prevent the backup from completing.

I looked in the sqlite3 shell command source, and it uses method 2. So if using the .backup command with continuous concurrent writes, you have to take a read lock on the db before .backup to ensure it finishes. It would be nice if the .backup command took a -step option. That would enable the -1 step size feature of method 1. The sqlite3 shell uses a step size of 100.

Another option would be to check backup_remaining() and backup_pagecount() after each step, and if the backup isn't making progress, increase the step size. Once the step size is equal to backup_pagecount() it will succeed, though it may have to lock out concurrent writes for a long time on a large db. There's really no other choice unless you get into managing db logs.


It's not what I've experienced!

In my experience, as soon as there is some new data coming in the DB, the .backup command will continue, and if the writes are not stopping, the backup will never stop as well :D

In Gladys case, we put in the application logic a blocking transaction to lock writes during the backup. I haven't found any other way to avoid infinite backups in case of write-heavy databases


I'm using VACUUM INTO, which does basically that: https://sqlite.org/lang_vacuum.html#vacuuminto

> The VACUUM command with an INTO clause is an alternative to the backup API for generating backup copies of a live database....The VACUUM INTO command is transactional in the sense that the generated output database is a consistent snapshot of the original database.

EDIT: Litestream docs will also recommend that: https://github.com/benbjohnson/litestream.io/issues/56


I ran into the case of needing to back up a write-heavy database without blocking anything, and came up with a solution: Writing a VFS ( https://www.sqlite.org/vfs.html ) that makes the application move the Sqlite journals file to a directory for processing instead of deleting them. Another process reads them to see what pages were touched and can very quickly get the changes to update the backup.


to be fair, this sounds like a situation where Litestream is the more appropriate solution!


This surprises me. I was under the impression that SQLite backups run inside a dedicated transaction in order to avoid this.


See the SQLite online backup API documentation [1].

You can finish the backup in one step, but a read-lock would be held during the entire duration, preventing writes. If you do the backup several pages at a time, then

> If another thread or process writes to the source database while this function is sleeping, then SQLite detects this and usually restarts the backup process when sqlite3_backup_step() is next called. ...

> Whether or not the backup process is restarted as a result of writes to the source database mid-backup, the user can be sure that when the backup operation is completed the backup database contains a consistent and up-to-date snapshot of the original. However: ...

> If the backup process is restarted frequently enough it may never run to completion and the backupDb() function may never return.

The CLI .backup command does non-blocking backup IIRC so is subject to restarts.

[1] https://www.sqlite.org/backup.html


This comment cleared that up for me: VACUUM INTO backups are transactional, but the .backup mechanism is not: https://news.ycombinator.com/item?id=31387556

Ben I suggest updating that cron backups documentation page to recommend VACUUM INTO instead!



Good catch. I went ahead and added an issue on the docs. https://github.com/benbjohnson/litestream.io/issues/56


The example uses Dead Man's Snitch, Healthchecks.io is my preference and their plans offer more for less. Both offer code on Github for examination.

https://deadmanssnitch.com/plans vs. https://healthchecks.io/pricing/


Is there anything self hosted in this space?


Yes, Healthchecks is open-source and can be self-hosted https://github.com/healthchecks/healthchecks


Depending on your environment, i.e. if you have a metrics system in place already, I generally just post a success metric and alert on lack of success metrics.



Indeed and in my experience this has been more reliable/predictable than the official backup API. The official backup API can get "stuck" on a very large and hot database because it, essentially, has to start from scratch if a page is modified. VACUUM INTO doesn't have this problem.


May I chime in with a project around more or less the same idea and use cases. https://github.com/efrecon/sqlite-backup. I am pretty sure there are dozens of similar-but-not-quite projects on GitHub/gitlab/etc.


If you use cron for anything important you should use Cronitor - there’s a free for life plan that can cover your SQLite backups.

The article mentions calling a “dead man” service and that is fine but Cronitor (built by 3 people, no VC dollars) is a proper monitoring solution for jobs not just a “dead man” alert. Happy monitoring


In terms of restoring from backups: "Be sure to stop your application before copying the database into place"

This intuitively feels right to me, but I'd be interested to understand the mechanics behind it. What can go wrong if you ignore this advice and use "mv" to atomically switch out the SQLite file from underneath your application?

My hunch is that this relates to journal and WAL files - I imagine bad things can happen if those no longer match the main database file.

But how about if your database file is opened in read-only or immutable mode and doesn't have an accompanying journal/WAL?


On Linux at least, if you switch the file, the old file (even if deleted) will still be used until the file descriptor is closed.


If your app is running, it will already have the database file open, and with `mv` the app will keep using the old file.

With SQLite and WAL, you need to replace both the main file and the WAL (/delete the WAL at the time you move the backup in place). A simple `mv` won't do that.


I rolled my own terrible sqlite replication system once. I used execution tracing hooks (https://rogerbinns.github.io/apsw/execution.html#tracing) to record all mutation queries and their arguments sent to the database, then periodically copied that log to a replica system and executed the identical statements.

It's rather primitive, but could be considered an intermediate step between this blog's full copying and litestream's WAL-based replication.


I do similar, it's a relatively small but important SQLite database so every five minutes I take a backup using that method, and check the generated SQL files into remote git repositories.

One on GitHub which is just a snapshot, a single commit in an otherwise empty repo, force pushed. This is for recovery purposes, I don't need the history and would probably run afoul of their service limits if I did so.

And the other on Azure DevOps which has the entire commit history for the past few years. This one is a bit trickier because the pack files end up exhausting disk space if not cleaned up, and garbage collection interrupts the backups. So it clones just the latest commit (grafted), pushes the next commit, and wipes the local repo. No idea how this looks on the remote backend but it's still working without any size complaints, and it's good to know there's an entire snapshotted history there if needed. As well as being able to clone the most recent for recovery if GitHub fails.


I've always assumed sqlite database files are resilient to being copied, mid-write. Sure: maybe that last transaction will be corrupted, but everything up to that point will be in tact, right? The next backup will capture whatever was in the half-complete transaction.

Am I deluded in this?


If you make an atomic copy -- like something equivalent to a power failure, then it's transactionally safe. One way this can be done is with a snapshot-capable filesystem like ZFS, btrfs, or XFS.

But a traditional `cp` will go from left to right in the file, it won't take the contents all at once. Which is explicitly documented as a thing that will break Sqlite - https://www.sqlite.org/howtocorrupt.html


Litestream author here. Your backup can be corrupt if you simply use "cp" but the details depend on your journaling mode.

If you're using the default rollback journal mode, it works by copying old pages to a "-journal" file and then updating the main database file with new pages. Your transaction finally commits when you delete the journal file. Copying the main database file during a write transaction can give you either a subset of transaction pages or half-written pages.

If you're using the WAL journaling mode, it works by writing new pages to the "-wal" file and then periodically copying those pages back to the main database file in a process called "checkpointing". If you only "cp" the main database file then you'll be missing all transactions that are in the WAL. There's no time bound on the WAL so you could lose a lot of transactions. Also, you can still corrupt your database backup if you "cp" the main database file during the checkpoint.

You could "cp" the main database file and the WAL file and probably not corrupt it but there's still a race condition where you could copy the main file during checkpointing and then not reach the WAL file before it's deleted by SQLite.

tl;dr is to just use the backup API and not worry about it. :)


Do the cron entries work as-is? I’m sitting here trying to decide if cron will break here:

    aws s3 cp /path/to/backup.gz s3://mybucket/backup-`date +%H`.gz
The % is going to be interpreted as a newline, it should need escaping as in date +\%H

But it’s in backticks (subshell) so now I’m doubting myself and thinking maybe the subshell saves this but my brain is screaming NO NO BUG! :-)

I’m going to have to spin up a VM to test it and seek inner peace here :-)


Yeah this isn’t used in the cron line itself, this is wrapped in a script further down the article and the script is run from cron, which makes sense and no bug.


Question:

> Do not use `cp` to back up SQLite databases. It is not transactionally safe.

I’m curious what happens. Does this mean that it may cause side effects from transactions that are not committed to become visible?

I guess if you could snapshot the database file and WAL and etc. simultaneously this isn’t an issue, right? Because otherwise it would be a problem for a database any time your program or machine crashed.


Ignore the journal and wal for now (though they certainly contribute to backup complexity).

You have a 70GB database file with "apple" near the start and "banana" near the end. You start cp and it copies apple. The application updates apple to "cat", and then banana to "dog". Finally cp copies dog to backup.

You now have a database which contains "apple" and "dog" which never existed at any point in the original timeline.


OK. So CoW snapshotting and other techniques that could take the files at a given point in time would not be subject to this issue? Just to be completely clear. Or, are there other issues?


I always thought that any file based solution was a risk when copying live databases or file systems. You should always pause writes before your snapshot, but CoW does make this process significantly faster.

I’ve never done this with a SQLite database, but I have done it with running VM file systems (with a CoW backend). And the filesystem pause was always very short.


Fun to see that we share some very similar ideas with Replibyte - an open source tool to seed a dev database with real data. https://github.com/Qovery/replibyte


    Do not use `cp` to back up SQLite databases.
    It is not transactionally safe.
What if you disable the journal like this:

PRAGMA journal_mode=OFF;

Can cp be used to backup the DB then?

And do you still have to send "COMMIT" after each query or will each query be executed immediately then?


The main reason is that the backing file is random access, and transactional correctness is ensured using various locking mechanisms. Sqlite generally relies on fsync() ensuring that data is flushed to disk at certain points, so conceptually, data is written to an unused portion of the file, then fsync() is done, then various structures are updated so that this data is now live, leaving some unused space in the middle of the file, and this is fysnc()'d again. Later that unused space might be reclaimed, but only when all current readers and writers have finished.

If you cp the file, you might end up with different chunks from different points in the transaction history that don't make sense combined together. If you use ".backup", it guarantees that the data in the copy will be consistent with the data as it existed at one of the fsync() calls.

Turning off the journalling will likely increase the chance that your copy will be inconsistent, as there will be more churn in the main data file.


>Do not use `cp` to back up SQLite databases. It is not transactionally safe.

I've read that this is true, but it has always confused me because I would expect that using cp would be equivalent to if the application had crashed / you lost power.


`cp` is not instantaneous. Unless you lock the entire file, one block can change while you're reading another block. You end up with a Frankenstein of a file, compared with a consistent snapshot if you had simply lost power.

If your filesystem supports snapshotting, it would be safe to create a snapshot and `cp` the database off of that snapshot.


No, because cp takes a non-zero amount of time and so by the time it has completed the first blocks may have been rewritten, thus the backup is inconsistent. If you take something like a ZFS snapshot, then, yes, that would be a consistent backup.


Would cp --reflink=always work too?


Not an expert but the transaction log is in another file and you’d need both to be in line with each other and backed up, so perhaps that’s a factor


I wrote up a bit longer explanation of the failure modes (which depend on your journaling mode) in this comment: https://news.ycombinator.com/item?id=31387447


If I shouldn't use a file copy to make a backup, what does this mean for the possibility that the plug might be pulled at any moment during a write?


Another option is to use the `.dump` command that produces a series of SQL statements to recreate the database and populate it with data, just like good ol' mysqldump.

I'm not sure whether this is faster or safer than `.backup` which produces an actual database file with indexes and whatnot, but a benefit of the plain text output is that it's very flexible. The output can be piped to gzip or any other program on the fly, even across the network, without waiting for an intermediate file to be written in full. It can also be imported to other types of RDBMS with minimal changes.


The .dump command is a great option for smaller databases. The main downsides are that it produces a larger output, it'll take longer to compress that output, and the restore time is much longer since all the commands have to be replayed.

The benefit to using the backup API is that the file is ready-to-go as the database and can be started up immediately.


Beware, `.dump` doesn't preserve pragmas like `user_version`, and some SQLite-using apps use that to manage schema migrations.

    sqlite> PRAGMA user_version=42;
    sqlite> .dump
    PRAGMA foreign_keys=OFF;
    BEGIN TRANSACTION;
    COMMIT;
    sqlite>


> Do not use `cp` to back up SQLite databases. It is not transactionally safe.

Anyone know if comitting the sqlite db file to git is not safe either?


If you’re not actively writing to the database you can cp or commit it safely. SQLite uses a journal or a write ahead log (WAL) file to process transactions.

If you cp a database mid transaction you will loose the transaction as it is processed in the separate journal or WAL file until commit. If you copy mid commit then you will have incomplete data in the main database file and possibly data corruption.


Anything that only uses the file system copying mechanics is not safe (which includes a git check-in). I wrote a more detailed comment here: https://news.ycombinator.com/item?id=31387447


I was amazed by the Livestream post on fly.io - it wants to become a distributed database. This is huge.


Nitpick: On a modern Linux system, I would probably use a systemd timer instead of cron.


> B-tree databases like SQLite compress well

Curious, why is this?


Litestream author here. When a page gets full then it splits in two so each new mode has empty space for inserts and updates. That empty space compresses really well. Also, OLTP data can have duplication when you have a categorical column that you haven’t moved into a separate table (eg a “type” column that has a couple unique values).


This question nerd-sniped me. I don't know either but my first guess, supported by hex dumping a medium-sized database, is that sqlite B-tree pages are collections of 32-bit page offsets that the database is designed to cluster (also, just lots and lots of zeroes). Also, the database I looked at was mostly UTF-8 text, which is obviously compressible, but I don't think that's what Ben meant.


If Ben is reading the thread, tiny but important bug in the example script. It should have set -e at the top to ensure that a failed step will actually fail the script. Otherwise the snitch will always get called.

Edit: filed an issue on the docs repo: https://github.com/benbjohnson/litestream.io/issues/54


Thanks for catching that! I fixed it up in a PR and merged in: https://github.com/benbjohnson/litestream.io/pull/55


This is so great - open source in action - I can't think of a more apt thing to come out of this post than an on the fly fix to the docs!

Pun absolutely intended.


If I'm using "-e", I also like to use "set -o pipefail" which will fail the script if any command in a pipeline fails too: x | y | z, which is useful if 'y' fails but 'z' continues to process the output and returns 0, -e won't catch that (the example script doesn't use pipes though).


Yep, my standard bash header is set -eo pipefail with x if I need to see what's going on. This particular script didn't appear to be using pipes, as you say.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: