Computing a SHA1 (Ed: for every password in the file) is hardly free, which sugg...

nine_k · on Dec 17, 2022

Modern CPUs have hardware acceleration for computing SHA1 and SHA2. They can be effectively free if they can be computed by CPU cores faster than the data from disk (likely NVMe now) are committed to RAM.

But a crypto hash like SHA may be a wrong hash here. Something like murmur / city / metro hash may be a better solution, since they compute 3-5 bytes of hash per clock.

Retric · on Dec 17, 2022

You are focusing on the wrong step. “Compute the SHA1 of each password. Sort. Store in a binary file”

Preprocessing the file and adding hashes isn’t free.

le-mark · on Dec 17, 2022

You are being down voted because although your point is valid, it has zero relevance because the goal is fast lookup. Preprocessing cost is not a factor.

Retric · on Dec 17, 2022

Sure, my point was if prepossessing is allowed then arguably so should having a hot cache.

It’s an arbitrary problem, so you could argue that pre computation is fine but you need to assume a cold cache or whatever.

coldtea · on Dec 17, 2022

>Sure, my point was if prepossessing is allowed then arguably so should having a hot cache.

Preprocessing is always an option: just needs some ahead of time effort.

A hot cache for a 37GB file is not always an option.

Retric · on Dec 17, 2022

Preprocessing isn’t an option if you’re given the file at the same time you get the list to check.

Aka are you building a service or grep.

chii · on Dec 17, 2022

it is a factor - you calculate the amortized cost of calculating the hashes (using the total access count as the amortization count).

HyperSane · on Dec 17, 2022

For any indexing operation that is always assumed to be asymptotically zero.

nine_k · on Dec 17, 2022

I'm saying that adding hashes is basically free, given the right hash function.

Sorting is very much not free though. I don't see why would sorting be needed if you have an efficient hash table, or a search tree. Essentially TFA describes just that, an index file + data file approach. They could have used SQLite which readily provides indexed access, instead of pure Python.

Of course if you never do the same search again, you're back to mmap-ing the file and doing a full scan smartly, like GNU grep does (https://lists.freebsd.org/pipermail/freebsd-current/2010-Aug...).

lern_too_spel · on Dec 17, 2022

You are being downvoted because the file is given as sorted sha1 hashes, and the benchmark in the article is about speeding up multiple lookups.