Keeping our users’ data safe with Bloom filters

Here’s a story that I’m sure is familiar to you. You want to do something, so you sign up for a new service to help you. How long do you spend carefully crafting the password for that new site?

If you’re reading this, you would probably tell me it takes a few seconds for you to summon your password manager, type in the master password and generate a new password.

For many people though, it is simply the time it takes for them to remember their only password. The one they use for everything from email to that random SaaS startup in some far-off country.

Today it has become almost normal for businesses to be hacked and passwords published online.

Imagine you’re a hacker. You want to log in to other peoples’ accounts as quickly as possible to spread pictures of your cat all around the internet because she’s the best.

You could try every character for every account but that’d take forever.

Other hackers before you have already posted millions of passwords that they acquired online. These are passwords that people really use. You could try these, or the most common ones and then decide it is too much work and move on to the next account.

This is what the landscape looks like today. We take our users’ data very seriously and take a multitude of measures to keep it safe. As a developer at Ometria looking at this threat, what can I do to protect our users and their data?

We can use these same lists to help our users. Thankfully Troy Hunt has ethically acquired many lists and collated them for us. He has even gone so far as to provide an API!

Wait, you want me to send my users passwords to an API? Not quite. We can provide the first few hex characters of a hash and it can return the remainder of the hashes for us to check against our hash.

This is great for most people but for some we’re still a little uneasy about sending our users passwords externally in any form even if it is just part of a hash.

Troy provides the hashes (currently 501m of of them) to download. Wait, 501 million? How many GiB does that take? 8.8 compressed.

Do you want that in a docker container?

As I didn’t really want a 9GiB docker container in our Kubernetes cluster I decided to try something else. What if I don’t actually store the passwords? I can choose to reduce it to 2.1GiB and accept a less than 1 in 1,000,000 chance of being wrong.

Enter the Bloom filter. Invented by Burton Howard Bloom, it can store data and, when queried, can say that the item was never stored or that it was probably stored.

What do you mean “probably?” A Bloom filter works by hashing data and using the hash to determine which bits to set or “turn on”. When checking a value, it is hashed and the same bits are checked. If a bit is not set or “off” then the value has not been added to the Bloom filter. If it is set, then this hash or another hash has set this bit. If this sounds interesting, I encourage you to investigate Bloom filters further.

I said earlier that Troy provides hashes of passwords and the Bloom filter hashes the values. It is important to choose a hashing algorithm that has a uniform distribution for a Bloom filter and in my tests this appears to be true of SHA1 (the hash used by Troy). While some hashes are faster than others, hashing is slow and the fastest algorithm does no work. I decided to eliminate the second hashing step. Passwords are sent with a weak hash to the password checking service and it simply looks up the hash in the filter.

It turns out that loading 2 GiB of data from disk into RAM is not super fast so I had to extend the initial delay for our readiness and liveness probes. I also had to decide on some wording for a validation message to try and explain to a less-technical user why they couldn’t use that password.

After I developed this software, I integrated it into the Ometria platform and we now prevent our users from changing their password to something that is already known by the internet and any would-be hackers.

No system is perfect though. There is a very small chance that we scare our users without cause and they are unfortunate enough that their password hash matches other hashes but isn’t actually leaked. This list will become out of date but the passwords will still remain leaked.

Helping our users choose secure passwords is one of the many ways that we look after our customers’ data.

Using pg_repack with AWS Postgres RDS

If you are using PostgresSQL as your database, you will be familiar with the concepts of table bloat and using CLUSTER indexes to improve scan performance. pg_repack is a very useful tool that allows tables to be repacked (equivalent to a CLUSTER or VACUUM FULL depending on configuration) without taking any exclusive locks, and therefore leaving the table online for both reads and writes.

For a while, users of Amazon’s RDS postures service were left unable to use pg_repack, as the extension was not whitelisted by RDS. Luckily support for the pg_repack extension was added to RDS mid 2017. However, multiple users have reported issues running pg_repack on RDS due to the unusual superuser permission setup on RDS (its not true superuser, it’s a proxy rds_superuser that’s almost the same but not quite). Therefore, we experienced this issue as well, when pg_repack was running, some other transactions would fail with this error:

ProgrammingError: permission denied for relation log_29334438
CONTEXT: SQL statement "INSERT INTO repack.log_29334438(pk, row)

pg_repack works by creating a temporary ‘log’ table in the repack schema for each repack operation, that holds updates applied to the source table during the operation. A database trigger is used to log these entries, and at the end of the table rebuild these logged modifications are applied to the repacked table. The core of the issue appears to be that the log tables did not have the right permissions when created on RDS via the rds_superuser role. For us the following fix works well:


Therefore, assigning PUBLIC write access to repack log tables created in the future.

Update on Meltdown and Spectre

On the 3rd January 2018, a new class of attacks against modern CPU microarchitectures was disclosed to the public. Two specific demonstrable attacks were disclosed: Meltdown and Spectre. The general impact of the exploit is that a malicious attacker who can run code on a particular computer may be able to observe memory contents outside of the memory areas it’s authorised to access. Of particular concern was the Meltdown attack in conjunction with modern virtualization schemes, that allow several workloads to share the same server to potentially observe each other’s memory contents. Additionally, it is theoretically possible that malicious JavaScript code executing inside a web-browser could access memory belonging to other tabs in the same browser session, thus allowing a malicious website to ‘steal’ cookie values, credentials, PII or similar. More information on the disclosure can be found here:

Ometria runs on Amazon Web Services (AWS). Soon after the disclosure (4th January 2018), AWS issued an update to all running server hosts in their data centres and these patches were automatically applied by our systems. Therefore we believe our systems to be protected against this attack. We will continue to monitor this issue and will continue to automatically apply system updates as needed.

It’s important to stress that the vulnerabilities disclosed are theoretical, and the steps required to actively exploit them are hugely complex. Given the complexity of exploiting them, the short time between disclosure and security updates being applied, it is highly unlikely they were utilised in any way. Also, it’s important to note that all Software as a Service (SAAS) providers are similarly affected, as this issue affects the CPUs used in the majority of servers worldwide.

It’s also worth noting that individual computers and mobile phones may also be vulnerable, although to a lesser degree, therefore it’s worth making sure that any systems you or your team use to access sensitive information are fully up to date. Specifically you should:

  • Update your operating systems with the latest patches. Mac OS X will release a fix shortly. Windows has released an update.
  • Update your browsers. Browsers are continually releasing new features and updates. As a best practice, you should enable automatic updates on your browser.