Great question! Rsync also uses a rolling hash/content defined chunking approach... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		ylow on Oct 8, 2024 \| parent \| context \| favorite \| on: Improving Parquet Dedupe on Hugging Face Hub Great question! Rsync also uses a rolling hash/content defined chunking approach to deduplicate and reduce communication. So it will behave very similarly.

kwillets on Oct 8, 2024 [–]

One more: do you prefer the CDC technique over using the rowgroups as chunks (ie using knowledge of the file structure)? Is it worth it to build a parquet-specific diff?

ylow on Oct 8, 2024 | [–]

I think both are necessary. The cdc technique is file format independent. The row group method makes Parquet robust to it.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact