Bloom filter

A bloom filter is a probabilistic data structure used to test whether an element is a member of a set. It is similar to a Set, in that you can add items to it and check whether items are present, but with a critical difference: it cannot give a definitive "yes." A bloom filter can say with certainty that an item is not in the set, but when it reports that an item is present, that answer is only a "maybe."

A false negative – reporting an item absent when it was actually added – is impossible in a standard bloom filter.
A false positive – reporting an item present when it was never added – can happen.

This asymmetry is what makes bloom filters useful. When an application can tolerate occasional false positives but must never miss a true negative, a bloom filter can offer a dramatic reduction in memory usage compared to storing the full set.

When to use a bloom filter

The canonical motivating example is a web browser protecting users from malicious links. Storing 1,000,000 known malicious URLs at 20 characters each would require around 20MB. A bloom filter representing the same data with a false positive rate of 0.01% (1 in 10,000) would require a fraction of that memory. False positives merely cause an unnecessary warning to the user – an acceptable trade-off. To eliminate those spurious warnings entirely, the browser can confirm any "maybe" result with a server-side API call. The bloom filter still saves a network request for the vast majority of links that are definitively clean. Google Chrome used a bloom filter for exactly this purpose until 2012.

Other real-world uses:

Akamai uses bloom filters to avoid caching one-off page accesses. A page is only written to cache if the bloom filter indicates it has been seen before, preventing the cache from filling with pages that are never revisited.
Google BigTable uses bloom filters to avoid unnecessary disk reads. When a key lookup arrives, the in-memory bloom filter is checked first. A definitive "not present" response saves a disk access entirely.

How bloom filters work

At its core, a bloom filter is a fixed-size array of bits, all initialized to 0. It uses k independent hash functions.

Adding an item: The item is passed through each of the k hash functions. Each hash value is taken modulo the array length to produce a bit position. All k positions are set to 1.

Checking for an item: The same k hash functions are applied to the query item. If all of the resulting bit positions are 1, the filter returns "maybe present." If any position is 0, the filter returns "definitely not present."

A false positive occurs when all of the bit positions for an item happen to have been set by other previously added items. This becomes more likely as the filter fills up.

False positive rate

The false positive rate grows as more bits are set. If x is the fraction of bits currently set to 1 and k is the number of hash functions, the approximate false positive rate is x^k. At half capacity (x = 0.5) with 3 hash functions, the false positive rate is 0.5^3 = 12.5%.

Using more hash functions lowers the false positive rate when the filter is relatively empty – each additional hash function reduces the chance that all bit positions happen to be set by coincidence. However, more hash functions also set more bits per item, filling the filter faster. There is an optimal number of hash functions for any given filter size and expected number of items.

Tuning a bloom filter

If the number of items to be stored (n) and the desired false positive rate (p) are known in advance, the optimal number of bits and hash functions can be calculated:

Number of bits: m = -(n * ln(p)) / (ln(2))^2
Number of hash functions: k = (m / n) * ln(2)

In practice, libraries handle this calculation. The key input is an accurate estimate of n. When uncertain, it is better to over-provision: a larger filter costs very little memory relative to modern hardware, and the consequences of under-provisioning (a rising false positive rate) can be significant.

Deletion

Items cannot be removed from a standard bloom filter. Because multiple items can share bit positions, setting a bit back to 0 to remove an item would risk corrupting the representation of other items.

Counting bloom filters

A counting bloom filter is a variant that replaces each bit with a small counter. Insertion increments the relevant counters, and deletion decrements them. This makes deletion possible, but at a cost:

Counters occupy more memory than bits (typically 8× more).
Counters introduce the possibility of false negatives. If two items share bit positions and one is removed, the counter for a shared position drops to zero, making the filter incorrectly report the other item as absent – eliminating the core guarantee of a standard bloom filter.

Counting bloom filters should only be used when deletion is genuinely required and the trade-offs are acceptable.