Full disk encryption or similarly transparent data encryption at the database al...

CiPHPerCoder · on June 4, 2024

> Not sure what the state the art is in searchable encryption for db indexes, but just trying to do stuff that requires a scan becomes untenable due to having to read and decrypt on the client to find it or aggregate it.

There are a lot of different approaches, but the one CipherSweet uses is actually simple.

First, take the HMAC() of the plaintext (or of some pre-determined transformation of the plaintext), with a static key.

Now, throw away most of it, except a few bits. Store those.

Later, when you want to query your database, perform the same operation on your query.

One of two things will happen:

1. Despite most of the bits being discarded, you will find your plaintext.

2. With overwhelming probability, you will also find some false positives. This will be significantly less than a full table scan (O(log N) vs O(N)). Your library needs to filter those out.

This simple abstraction gives you k-anonymity. The only difficulty is, you need to know how many bits to keep. This is not trivial and requires knowing the shape of your data.

https://ciphersweet.paragonie.com/security#blind-index-infor...

I proposed the same technique to AWS, who adopted it under the name Beacons for the AWS Database Encryption SDK.

https://docs.aws.amazon.com/database-encryption-sdk/latest/d...

adunsulag · on June 4, 2024

I was reading your reply and started thinking, this sounds a lot like what I did to do encrypted search with Bloom Filters and indexes. I click on the first link and find the exact website I used when researching and building our encrypted search implementation for a health care startup. It worked fabulously well, but it definitely requires a huge amount of insight into your data (and fine-tuning if your data scales larger than your initial assumptions).

That's awesome that AWS has now rolled it into their SDK. I had to custom build it for our Node.JS implementation running w/ AWS's KMS infrastructure.

Are you the author of the paragonie website? The coincidence was startling. If so, I greatly thank you for the resource.

Edit After going back and re-reading the blog post, looks like you are the author. Again thank you, you were super helpful .

CiPHPerCoder · on June 4, 2024

> Are you the author of the paragonie website? The coincidence was startling. If so, I greatly thank you for the resource.

Thanks. Yes, I'm one of the authors.

withinboredom · on June 4, 2024

One way I’ve seen (eg, searching by zip code) is to encrypt all possible buckets you would search by (prefixes/suffixes) using a different (search) key, then encrypting the relationship foreign keys. Then the application searches for the encrypted values and decrypts the foreign keys.

kbolino · on June 4, 2024

This strategy provides only obfuscation, not encryption. If the same plaintext always "encrypts" to the same ciphertext, it becomes possible (sometimes even trivial) for an attacker with access to large amounts of related information (such as the entire database) to use correlations and inference to effectively decipher it.

withinboredom · on June 4, 2024

One-time-pads effectively save you here. The application knows zipcode 1234 == "AWER", but the database doesn't and there isn't any way to derive that without outside information. The technique is a pseudo-anonymization technique, not encryption.

kbolino · on June 4, 2024

Assuming you want "find all users in zipcode 12345" to be a supported query, it does not matter what encryption scheme you use, you will have one of these two problems:

On the one hand, you can require that 12345 always maps to AWERQ, in which case an attacker can use frequency analysis, metadata chaining, etc. to determine with some confidence that AWERQ = 12345. Calling this "pseudo-anonymization" is definitely more accurate than calling it "encryption", but you might as well just use a one-way hash function instead. It doesn't do anything against determined attackers with prolonged or broad exposure to the data; I don't see the value except perhaps for compliance with poorly thought out or outdated regulations.

On the other hand, you can require that 12345 always maps to a different string every time, but that means you need a different key/salt/IV/nonce for every row or cell, defeating indexing and aggregation, and so all queries become full table scans. This significantly frustrates an attacker, but also significantly frustrates legitimate operations.

withinboredom · on June 5, 2024

With something like zip codes, so long as all the data is encrypted, there's very little chance someone can work out what that zipcode is (and if there isn't any information in the column name, even less). The only way someone could determine that it is a zipcode is by looking at commonality with known decrypted data. Even if they were to determine that it was a zipcode, they would only know which zipcode it was for the users they had decrypted. In other words, the blast radius is very small and compartmentalized, while still allowing searches.

kbolino · on June 5, 2024

If an attacker has a full data dump, they probably have a recent/frequent queries dump too. Column name obfuscation won't go very far.

Zipcodes are short. If not given extra padding, their ciphertexts will still be short.

Zipcodes are also low cardinality. Unless you use multiple salts/nonces/IVs/keys, the frequency of ciphertexts will match the frequency of plaintexts.

In many situations, a prepared attacker will be able to insert their own data beforehand, allowing them to perform a chosen-plaintext attack and potentially decipher much of the data. The best protection against this is to not reuse salts/nonces/IVs/keys and thus again defeat performant searches.

None of this is to necessarily say it's not worth it but rather to keep in mind the article's point: know your threat model.