> The chunk boundaries should be determined deterministically from local properties of the data. Use a rolling checksum over some small window and split the file it it hits a special value (0). This is what the rsyncable patch to zlib does.
> We already have the rsync algorithm which can scan through a file and efficiently tell which existing chunks match (portions of) it, using a rolling checksum. (Here's a refresher: http://samba.anu.edu.au/rsync/tech_report/node2.html ). Why not treat the 'chunk' as the fundamental unit, and compose files from chunks?
Here's a post to the git mailing list from Martin Uecker describing the same from 2005: https://lore.kernel.org/git/20050416173702.GA12605@macavity/ . From the tone of the email it sounds like he didn't consider the idea new at that point:
> The chunk boundaries should be determined deterministically from local properties of the data. Use a rolling checksum over some small window and split the file it it hits a special value (0). This is what the rsyncable patch to zlib does.
He calls it a merkle hash tree.
Edit: here's one that's one day earlier from C. Scott Ananian: https://lore.kernel.org/git/Pine.LNX.4.61.0504151232160.2763...
> We already have the rsync algorithm which can scan through a file and efficiently tell which existing chunks match (portions of) it, using a rolling checksum. (Here's a refresher: http://samba.anu.edu.au/rsync/tech_report/node2.html ). Why not treat the 'chunk' as the fundamental unit, and compose files from chunks?