Keeping our users’ data safe with Bloom filters

Here’s a story that I’m sure is familiar to you. You want to do something, so you sign up for a new service to help you. How long do you spend carefully crafting the password for that new site?

If you’re reading this, you would probably tell me it takes a few seconds for you to summon your password manager, type in the master password and generate a new password.

For many people though, it is simply the time it takes for them to remember their only password. The one they use for everything from email to that random SaaS startup in some far-off country.

Today it has become almost normal for businesses to be hacked and passwords published online.

Imagine you’re a hacker. You want to log in to other peoples’ accounts as quickly as possible to spread pictures of your cat all around the internet because she’s the best.

You could try every character for every account but that’d take forever.

Other hackers before you have already posted millions of passwords that they acquired online. These are passwords that people really use. You could try these, or the most common ones and then decide it is too much work and move on to the next account.

This is what the landscape looks like today. We take our users’ data very seriously and take a multitude of measures to keep it safe. As a developer at Ometria looking at this threat, what can I do to protect our users and their data?

We can use these same lists to help our users. Thankfully Troy Hunt has ethically acquired many lists and collated them for us. He has even gone so far as to provide an API!

Wait, you want me to send my users passwords to an API? Not quite. We can provide the first few hex characters of a hash and it can return the remainder of the hashes for us to check against our hash.

This is great for most people but for some we’re still a little uneasy about sending our users passwords externally in any form even if it is just part of a hash.

Troy provides the hashes (currently 501m of of them) to download. Wait, 501 million? How many GiB does that take? 8.8 compressed.

Do you want that in a docker container?

As I didn’t really want a 9GiB docker container in our Kubernetes cluster I decided to try something else. What if I don’t actually store the passwords? I can choose to reduce it to 2.1GiB and accept a less than 1 in 1,000,000 chance of being wrong.

Enter the Bloom filter. Invented by Burton Howard Bloom, it can store data and, when queried, can say that the item was never stored or that it was probably stored.

What do you mean “probably?” A Bloom filter works by hashing data and using the hash to determine which bits to set or “turn on”. When checking a value, it is hashed and the same bits are checked. If a bit is not set or “off” then the value has not been added to the Bloom filter. If it is set, then this hash or another hash has set this bit. If this sounds interesting, I encourage you to investigate Bloom filters further.

I said earlier that Troy provides hashes of passwords and the Bloom filter hashes the values. It is important to choose a hashing algorithm that has a uniform distribution for a Bloom filter and in my tests this appears to be true of SHA1 (the hash used by Troy). While some hashes are faster than others, hashing is slow and the fastest algorithm does no work. I decided to eliminate the second hashing step. Passwords are sent with a weak hash to the password checking service and it simply looks up the hash in the filter.

It turns out that loading 2 GiB of data from disk into RAM is not super fast so I had to extend the initial delay for our readiness and liveness probes. I also had to decide on some wording for a validation message to try and explain to a less-technical user why they couldn’t use that password.

After I developed this software, I integrated it into the Ometria platform and we now prevent our users from changing their password to something that is already known by the internet and any would-be hackers.

No system is perfect though. There is a very small chance that we scare our users without cause and they are unfortunate enough that their password hash matches other hashes but isn’t actually leaked. This list will become out of date but the passwords will still remain leaked.

Helping our users choose secure passwords is one of the many ways that we look after our customers’ data.

If you’re interested in solving problems like this to help make the way retailers communicate with their customers more personalised and engaging, we are hiring! Check out our careers page.