Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thank you for sharing. A curious read. I am looking forward to the next post.

I've been working on backup and disaster recovery software for 10 years. There's a common phrase in our realm that I feel obligated to share, given the nature of this article.

> "Friends don't let friends build their own Backup and Disaster Recovery (BCDR) solution"

Building BCDR is notoriously difficult and has many gotchas. The author hinted at some of them, but maybe let me try to drive some of them home.

- Backup is not disaster recovery: In case of a disaster, you want to be up and running near-instantly. If you cannot get back up and running in a few minutes/hours, your customers will lose your trust and your business will hurt. Being able to restore a system (file server, database, domain controller) with minimal data loss (<1 hr) is vital for the survival of many businesses. See Recovery Time Objective (RTO) and Recovery Point Objective (RPO).

- Point-in-time backups (crash consistent vs application consistent): A proper backup system should support point-in-time backups. An "rsync copy" of a file system is not a point-in-time backup (unless the system is offline), because the system changes constantly. A point-in-time backup is a backup in which each block/file/.. maps to the same exact timestamp. We typically differentiate between "crash consistent backups" which are similar to pulling the plug on a running computer, and "application consistent backups", which involves asking all important applications to persist their state to disk and freeze operations while the backup is happening. Application consistent backups (which is provided by Microsoft's VSS, as mentioned by the author) significantly reduce the chances of corruption. You should never trust an "rsync copy" or even crash consistent backups.

- Murphy's law is really true for storage media: My parents put their backups on external hard drives, and all of r/DataHoarder seems to buy only 12T HDDs and put them in a RAID0. In my experience, hard drives of all kinds fail all the time (though NVMe SSD > other SSD > HDD), so having backups in multiple places (3-2-1 backup!) is important.

(I have more stuff I wanted to write down, but it's late and the kids will be up early.)



Ha. That quote made me chuckle; it reminded me of a performance by the band Alice in Chains, where a similar quote appeared.

Re: BCDR solutions, they also sell trust among B2B companies. Collectively, these solutions protect billions, if not trillions of dollars worth of data, and no CTO in their right mind would ever allow an open-source approach to backup and recovery. This is primarily also due to the fact that backups need to be highly available. Scrolling through a snapshot list is one of the most tedious tasks I've had to do as a sysadmin. Although most of these solutions are bloated and violate userspace like nobody's business, it is ultimately the company's reputation that allows them to sell products. Although I respect Proxmox's attempt at cornering the Broadcom fallout, I could go at length about why it may not be able to permeate the B2B market, but it boils down to a simple formula (not educational, but rather from years of field experience):

> A company's IT spend grows linearly with valuation up to a threshold, then increases exponentially between a certain range, grows polynomially as the company invests in vendor-neutral and anti-lock-in strategies, though this growth may taper as thoughtful, cost-optimized spending measures are introduced.

- Ransomware Protection: Immutability and WORM (Write Once Read Many) backups are critical components of snapshot-based backup strategies. In my experience, legal issues have arisen from non-compliance in government IT systems. While "ransomware" is often used as a buzzword by BCDR vendors to drive sales, true immutability depends on the resiliency and availability of the data across multiple locations. This is where the 3-2-1 backup strategy truly proves its value.

Would like to hear your thoughts on more backup principles!


> An "rsync copy" of a file system is not a point-in-time backup (unless the system is offline), because the system changes constantly. A point-in-time backup is a backup in which each block/file/.. maps to the same exact timestamp.

You can do this with some extra steps in between. Specifically you need a snapshotting file system like zfs. You run the rsync on the snapshot to get an atomic view of the file system.

Of course if you’re using zfs, you might just want to export the actual snapshot at that point.


Unless you are doing more steps, that is still just a crash consistent backup. Better than plain rsync, but still not ideal.


> having backups in multiple places (3-2-1 backup!) is important

Yeah and for the vast majority of individual cybernauts, that "1" is almost unachievable without paying for a backup service. And at that point, why are you doing any of it yourself instead of just running their rolling backup + snapshot app?

There isn't a person in the world who lives in a different city from me (that "1" isn't protection when there's a tornado or flood or wildfire) that I'd ask to run a computer 24/7 and do maintenance on it when it breaks down.


My solution for this has been to leave a machine running in the office (in order to back up my home machine). It doesn't really need to be on 24/7, it's enough to turn it on every few days just to pull the last few backups.


If you aren't at CERN level of data - you can always rent a VPS/dedicated server for this.

It's a matter of the value of your data. Or how much it would cost you to lose it.


3-2-1 analogy is old. We have infinite flexibility on where we can put data unlike before cloud servers existed.

I'd at least have file system snapshots locally for easy recovery in case of manual mistakes, have it copied at a remote location using implementation A and let it snapshot there too, copy same amount on another location using implementation B and let it snapshot there too, so not only you'd have durability, implementation bugs on a backup process can also be mitigated.

zfs is a godsend for this and I use Borg as secondary implementation, which seems enough for almost any disasters.


> You should never trust an "rsync copy" or even crash consistent backups.

This leads you to the secret forbidden knowledge that you only need to back up your database(s) and file/object storage. Everything else can be, or has to be depending on how strong that 'never' is, recreated from your provisioning tools. All those Veeam VM backups some IT folks hoard like dragons are worthless.


Exactly. There is no longer any point in backing up an entire "server" or a "disk". Servers and disks are created and destroyed automatically these days. It's the database that matters, and each type of database has its own tooling for creating "application consistent backups".


This strongly depends on your environment and on your RTO/RPO.

Sure, there are environments that have automatically deployed, largely stateless servers. Why back them up if you can recreate them in an hour or two ;-)

Even then, though, if we're talking about important production systems with an RTO of only a few minutes, then having a BCDR solution with instant virtualization is worth your weight in gold. I may be biased though, given that I professionally write BCDR software, hehe.

However, many environments are not like that: There are lots of stateful servers out there with bespoke configurations, lots of "the customer needed this to be that way and it doesn't fit our automation". Having all servers backed up the same way gives you peace of mind if you manage servers for a living. Being able to just spin up a virtual machine of a server and run things from a backup while you restore or repair the original system is truly magical.


For regular DB like MySQL/PostgreSQL, just snapshot on zfs without thinking.


Databases these days are pretty resilient to restoring from crash consistent backups like that, so yes, you'll likely be fine. It's a good enough approach for many cases. But you can't be sure that it really recovers.

However, ZFS snapshots alone are not a good enough backup if you don't off-site them somewhere else. A server/backplane/storage controller could die or corrupt your entire zpool, or the place could burn down. Lots of ways to fail. You gotta at least zfs send the snapshots somewhere.


How do you mean can’t be sure if it recovers? It’s not hoping for inconsistent states to be recovered by the db but they’re supposed to be in good state with file system snapshotting.

https://serverfault.com/a/806305

https://zrepl.github.io/v0.2.1/configuration/snapshotting.ht...


Ha! I did not expect a reference to `innodb_flush_log_at_trx_commit` here. I wrote a blog post a few years ago about MySQL lossless semi-sync replication [1] and I've had quite enough of innodb_flush_log_at_trx_commit for a lifetime :-)

Depending on the database you're using, and on your configuration, they may NOT recover, or require manual intervention to recover. There is a reason that MSSQL has a VSS writer in Windows, and that PostgreSQL and MySQL have their own "dump programs" that do clean backups. Pulling the plug (= file system snapshotting) without involving the database/app is risky business.

Databases these days are really resilient, so I'm not saying that $yourfavoriteapp will never recover. But unless you involve the application or a VSS writer (which does that for you), you cannot be sure that it'll come back up.

[1] https://blog.heckel.io/2021/10/19/lossless-mysql-semi-sync-r...


Also if you have a NAS, don’t use the same hard drive type for both.


My personal external backup is two external drives in RAID1 (RAID0 wtfff?). One already failed, of course the Seagate one. It failed silently, too - a few sectors just do not respond to read commands and this was discovered when in-place encrypting the array. (I normally would avoid Seagate consumer drives if it wasn't for brand diversity. Now I have two WD drives purchased years apart.)

It's a home backup so not exactly relevant to most of what you said - just wanted to underscore the point about storage media sucking. Ideally I'd periodically scrub each drives independently (can probably be done by forcing a degraded array mode, but careful not to mess up the metadata!) against checksums made by backup software. This particular failure mode could also be caught by dd'ing to /dev/null.


ZFS really shines here with its built-in "zpool scrub" command and checksumming.

Even though I am preaching "application consistent backups" in my original comment (because that's what's important for businesses), my home backup setup is quite simple and isn't even crash consistent :-) I do: Pull via rsync to backup box & ZFS snapshot, then rsync to Hetzner storage box (ZFS snapshotted there, weekly)

My ZFS pool consists of multiple mirrored vdevs, and I scrub the entire pool once a month. I've uncovered drive failures, and storage controller failures this way. At work, we also use ZFS and we've uncovered even failures of entire product lines of hard drives.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: