Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Great question! Rsync also uses a rolling hash/content defined chunking approach to deduplicate and reduce communication. So it will behave very similarly.


One more: do you prefer the CDC technique over using the rowgroups as chunks (ie using knowledge of the file structure)? Is it worth it to build a parquet-specific diff?


I think both are necessary. The cdc technique is file format independent. The row group method makes Parquet robust to it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: