# Bloom filter – Python

A **Bloom Filter** is a* space-efficient probablistic data structure*, to test whether an element is a member of a set. This algorithm for data representation ensures the element is in the dataset or not. This may have false positive matches but not false negative result.

The most common use of Bloom Filter Algorithm is to see if an elements is in the disk before performing any operations. This *reduces the I/O for lookups* dramatically over large datasets.

Consider we have to check a email address to search from large dataset of emails of millions. Searching all the emails in memory definitely is very inefficient and takes lots of time. We can create bloom filter bitarray, which is very small compared to the original dataset and has almost same required result. This is what used in Yahoo Mail. When you log into Yahoo mail, the browser page requests a **bloom filter** representing your contact list from Yahoo servers. The Bloom Filter is compact and easily fits in your browser cache. In Yahoo, email is verified if it is in existed in your contact list.

Another scenario, consider we have to *get unique counts in the dataset*, then we can use bloom filter to test if certain pattern or element has already been seen or not in the dataset. Of course this creates some false positives, but this might be much efficient than to compare everything from the memory.

*Apache HBase* uses bloom filter to boost read speed by filtering out unnecessary disk reads of HFile blocks which do not contain a particular row or column.

* Quora implemented a sharded bloom filter* in the feed backend to filter out stories that people have seen before. It is much faster and more memory efficient than previous solutions (Redis, Tokyo Cabinet and DB) and saves hundreds of ms on certain types of requests.(Tao Xu, Engineer at Quora)

* Transactional Memory (TM) *has recently applied Bloom filters to detect memory access conflicts among threads.(Mark Jeffrey, modeled Bloom filters for concurrency conflict detection).

Not to mention **Facebook** ( Typeahead Search Tech Talk (6/15/2010) ),, **LinkedIn** ( Cleo: the open source technology behind LinkedIn’s typeahead search | LinkedIn Engineering), **Bit.ly** (dablooms – an open source, scalable, counting bloom filter library ) have implemented their own Bloom Filter.

More examples ? Go here https://en.wikipedia.org/wiki/Bloom_filter#Examples

Ok then, enough of the usage of Bloom Filter. We will be developing our own for searching email address from a list in Python.

importnumpyas np frombitarrayimport bitarray importrandom# The coefficients coff_a and coff_b are randomly chosen integers less than the maximum value of value. # hashtable_size is a prime number slightly bigger than the maximum value of value. # @param coff_a,coff_b: hashing coefficients # @param val: value to be hashed # @param prime: size of hash table # @return: hash def(coff_a, coff_b, value, hashtable_size): return (coff_a * value + coff_b) % hashtable_size # set True to @position in @bitar:bitarray defhash_function(position, bitarr): bitarr[position] = 1 # find next prime number defupdate_bloom(n): return prime_range(n, 2 * n) defnext_prime(a, b): for p in range(a, b): for i in range(2, p): if p % i == 0: break else: return p return None defprime_range(email): return bloom_filter[hash_function(a, b, sum([ord(char) for char in email]), hashtable_size)]check_for_emailemail_list= ["trustgeek1@gmail.com", "trustgeek2@gmail.com", "trustgeek3@outlook.com", "thisisnewemail@gmail.com", "thiswebsite@hotmail.com", "registeredemail@yahoo.com"] # calculate ordinal sum of characters for every email-address - ascii code sum of charactersemail_ascii_sum_list= [sum([ord(char) for char in email]) for email in email_list] # since we have fixed size of input, the hashtable could have 1.0 load factor # yet, 0.75 load has much less colisions choose next prime number from the number # of elements in email list to avoid collisionshashtable_size= next_prime(int(round(len(email_ascii_sum_list) * 1.25))) print("Hashtable Size: " + str(hashtable_size)) # generate two random numbers as hashing coefficientsa= random.randint(1, max(email_ascii_sum_list) - 1)b= random.randint(2, max(email_ascii_sum_list) - 1) print ("Coefficient a: " + str(a) + ", b: " + str(b)) # hash emails with the hashing functionhash_list= [hash_function(a, b, unicode_sum, hashtable_size) for unicode_sum in email_ascii_sum_list] # assign hash_values & unicode-sums to emails print([(email, "->", unicode_sum, "->", hashp) for email, unicode_sum, hashp in zip(email_list, email_ascii_sum_list, hash_list)]) # initialise bloom filterbloom_filter=(hashtable_size) # set all bits to 0bitarraybloom_filter[:] = False print ("Bloom Filter: ", bloom_filter) # update bloom filter [update_bloom(index, bloom_filter) for index in hash_list] # for every email in the list check if it has already been seen by bloom filter ifcheck_for_email("trustgeek@hotmail.com"): print ("Exists") else: print("Not Exists") ifcheck_for_email("trustgeek1@gmail.com"): print("Exists") else: print("Not Exists") # todo dynamic update new email added to list

Excellent blog! Do you have any tips and hints for aspiring writers? Kerrill Lenard Norbie

I could not refrain from commenting. Well written! Lesli Cecilius Crawford