What is a Bloom Filter?
A Bloom filter is a probabilistic data structure that is based on hashing. It is an extremely space-efficient data structure and is typically used to add elements to a set and test if an element is in a set.
When testing if an element is present in the bloom filter, false positives are possible. It will either say that an element is definitely not in the set or that it is possible the element is in the set.
Working of Bloom Filters
An empty bloom filter is a bit array of m bits, all of which are initially set to zero. A bit array is an extremely space-efficient data structure with each position in the array set to either 0 or 1. A Bloom filter also includes a set of k hash functions with which we hash incoming values. These hash functions must all have a range of 0 to m−1. If these hash functions match an incoming value with an index in the bit array, the bloom filter will make sure the bit at that position in the array is 1.
Lets us understand this through an example. We have taken 3 hash functions and a bit array of size 10.
Initially
Now, the word geeks is entered, which gives h1(“geeks”)%10=4, h2(“geeks”)%10=7, h3(“geeks”%10)=1
Now,the word nerd is enteredwhich gives h1(“nerd”)%10= 5 , h2(“nerd”)%10=4, h3(“nerd”)%10=3
False Positives
False Positive claims that an element is present in the set, although that element was never inserted into the set.
Continuing with the previous example, let’s try to insert cat, which gives h1(“cat”)%10=7, h2(“cat”)%10=3, h3(“cat”)%10=1
As bits at calculated indices are already set by some other item, the Bloom filter erroneously claims that “cat” is presently generating a false positive result
Probability of False positivity: Let m be the size of the bit array, k be the number of hash functions, and n be the number of expected elements to be inserted in the filter, then the probability of false positive p can be calculated as:
p=(1-[1-1/m]kn)k
Size of Bit Array: If the expected number of elements n is known and desired false positive probability is p, then the size of bit array m can be calculated as :
m=-nlnp/(ln2)2
Applications of Bloom filters
- Medium uses bloom filters for recommending posts to users by filtering posts which the user has seen.
- The Google Chrome web browser used to use a Bloom filter to identify malicious URLs
- Google BigTable, Apache HBase, Apache Cassandra, and Postgresql use Bloom filters to reduce the disk lookups for non-existent rows or columns.