A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set.
The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set.
The base data structure of a Bloom filter is a Bit Vector. Here’s a small one we’ll use to demonstrate:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Each empty cell in that table represents a bit, and the number below it its index. To add an element to the Bloom filter, we simply hash it a few times and set the bits in the bit vector at the index of those hashes to 1.
It’s easier to see what that means than explain it, so enter some strings and see how the bit vector changes. Fnv and Murmur are two simple hash functions:
Enter a string:
fnv:
murmur:
Your set: []
When you add a string, you can see that the bits at the index given by the hashes are set to 1. I’ve used the color green to show the newly added ones, but any colored cell is simply a 1.
To test for membership, you simply hash the string with the same hash functions, then see if those values are set in the bit vector. If they aren’t, you know that the element isn’t in the set. If they are, you only know that it might be, because another element or some combination of other elements could have set the same bits. Again, let’s demonstrate:
Test an element for membership:
fnv:
murmur:
Is the element in the set? no
Probability of a false positive: 0%
And that’s the basics of a bloom filter!
Advanced Topics
Before I write a bit more about Bloom filters, a disclaimer: I’ve never used them in production. Don’t take my word for it. All I intend to do is give you general ideas and pointers to where you can find out more.
In the following text, we will refer to a Bloom filter with k hashes, m bits in the filter, and n elements that have been inserted.
Hash Functions
The hash functions used in a Bloom filter should be independent and uniformly distributed. They should also be as fast as possible (cryptographic hashes such as sha1, though widely used therefore are not very good choices).
Examples of fast, simple hashes that are independent enough3 include murmur, the fnv series of hashes, and HashMix.
To see the difference that a faster-than-cryptographic hash function can make, check out this story of a ~800% speedup when switching a bloom filter implementation from md5 to murmur.
In a short survey of bloom filter implementations:
Chromium uses HashMix. (also, here’s a short description of how they use bloom filters)
It’s a nice property of Bloom filters that you can modify the false positive rate of your filter. A larger filter will have less false positives, and a smaller one more.
Your false positive rate will be approximately (1-e-kn/m)k, so you can just plug the number n of elements you expect to insert, and try various values of k and m to configure your filter for your application.2
This leads to an obvious question:
How many hash functions should I use?
The more hash functions you have, the slower your bloom filter, and the quicker it fills up. If you have too few, however, you may suffer too many false positives.
Since you have to pick k when you create the filter, you’ll have to ballpark what range you expect n to be in. Once you have that, you still have to choose a potential m (the number of bits) and k (the number of hash functions).
It seems a difficult optimization problem, but fortunately, given an m and an n, we have a function to choose the optimal value of k: (m/n)ln(2)2, 3
So, to choose the size of a bloom filter, we:
Choose a ballpark value for n
Choose a value for m
Calculate the optimal value of k
Calculate the error rate for our chosen values of n, m, and k. If it’s unacceptable, return to step 2 and change m; otherwise we’re done.
How fast and space efficient is a Bloom filter?
Given a Bloom filter with m bits and k hashing functions, both insertion and membership testing are O(k). That is, each time you want to add an element to the set or check set membership, you just need to run the element through the k hash functions and add it to the set or check those bits.
The space advantages are more difficult to sum up; again it depends on the error rate you’re willing to tolerate. It also depends on the potential range of the elements to be inserted; if it is very limited, a deterministic bit vector can do better. If you can’t even ballpark estimate the number of elements to be inserted, you may be better off with a hash table or a scalable Bloom filter4.
What can I use them for?
I’ll link you to wiki instead of copying what they say. C. Titus Brown also has an excellent talk on an application of Bloom filters to bioinformatics.
My eyes crack open. 7am. Roll over. Grab my phone. Start scrolling…
Check Product Hunt. Ahh, shit. Someone just launched an app similar to my product (and we’re still in beta).
Scroll through Twitter. Shit! This person I’m jealous of just announced a another success.
Read Medium. Fuck. Someone wrote a post almost identical to what I wrote months ago and they’re getting more traction.
My envy grows large, my blood pressure goes up. I feel like I’m waking up in a heavy cloud; an agitated haze. Already on edge, and I haven’t even made toast yet.
Have you ever felt like this?
Stress is our body’s defense against bad news. It was useful when our ancestors were running around the jungle about to be eaten. It’s much less helpful now.
The problem is that the worry itself can harm you as much as the outcome you’re worried about. While you’re stressing over what might happen, your body is releasing adrenaline and cortisol as if you were actually in danger.
These hormones are what cause intense feelings: jealousy, anger, sadness, despair.
But even worse, they reduce your ability to make great stuff.Instead of putting your energy into creating, you’re obsessing about things you can’t change.
Here’s how to get out of that negative downward spiral:
Quit worrying about what everyone else is doing. Focus on how you’re helping people.
Agonizing over your competition doesn’t help you serve your customers better. Being jealous of your peers won’t improve your craft.
There’s only two things that will improve your situation:
Concentrate on your users, audience, customers, fans.
Figure out what they want. Develop a deep connection with them.
Improve your skill, expertise, competence, product.
How can you get better? How can you make your product better for the people who use it?
The best product doesn’t always win. The one everyone uses wins. Later we all tell ourselves a story about why and how that product was really the best all along in an impressive display of survivorship bias. In fact, the history of technology is littered with quantitatively superior products that lost in the marketplace because they weren’t well timed, well marketed, or well supported: VHS vs Betamax. Gasoline vsElectric. English vs Esperanto. Metric vs Imperial.
“Habit and user expectation remains a stronger moat than people appreciate.” – Ben Thompson
Winning products tend to continue to win until major disruption or mismanagement. The reason for that is simple: everyone else has to conform before they can compete. You become the de facto standard against which other things are measured. It is hard to differentiate for the better when people have already collectively endorsed a fiction that the best solution has already been found (and, what luck, it’s the one that won). Society is a dynamic part of product/market fit, and once a large number of humans have learned come to expect a certain behavior, there just isn’t enough value in learning a new system even if it is better. This isn’t just about switching costs between products but rather more fundamentally about interaction paradigms. Do you really think an app garden is the apex of all possible phone UIs? Is a vertical scrolling feed really the best way to stay up to date on friends? Does the QWERTY arrangement on a keyboard make any sense today?
This is the reason companies focus so much on growth. It isn’t just that growth signifies a fit with a market. But rather because growth actually makes the market fit stronger. Even when the product doesn’t involve network effects directly it still benefits from the habits and expectations of the population at large that accompany growth.