The Hashing Functions

Each item added to a bloom filter must be hashed by some number of hash functions which are completely independent of each other. Each hashing function must also be evenly distributed over the range of bit indices in the bit array. This second requirement is true of hashing functions for hash sets and hash tables as well. Uniform distribution is guaranteed by the built-in hash functions of Python and most other languages. In the examples above, three hashing functions were required. Sometimes the required number of hashing functions can be much higher, depending on the number of items being inserted and the number of bits in the bit array. Creating the required number of independent, uniformly distributed hashing functions might seem like a daunting problem, but it can be solved in at least a couple of ways. Some hashing functions allow a seed value to be provided. In this case, different seed values could be used to create different hashing functions.

Another equally effective way of generating independent hashing functions is to append some known value to the end of each item before it is hashed. For instance, a 0 might be appended to the item before hashing it to get the first hash function. A 1 could be appended to the item before hashing to get the second hash function. Likewise, a 2 might be appended to get the third hash function value. So looking up rabbit in the bloom filter is accomplished by first hashing rabbit0, rabbit1, and rabbit2 with the same hashing function. Since the hashing function is uniformly distributed, the values returned by the three hashed values will be independent of each other. And, all items with 0 appended will themselves be uniformly distributed. Likewise for items with 1 appended and with 2 appended.

The Bloom Filter Size

It is possible to find the required bloom filter size given a number of items to insert and a desired false positive probability. The probability of any one location within a bloom filter not being set by a hash function while inserting an item is given by the following formula where the filter consists of m bits.

If the bloom filter uses k hash functions, then the probability that a bit in the bit array is not set by any of the hash functions required for inserting an item is given by this

formula.

If n items are inserted into the bloom filter then raising this formula to n will provide the probability that a bit within the bloom filter's bit array is still a zero after inserting all n items. So we have

So, the probability that a bit in the bloom filter is a 1 after inserting n items while using k hashing functions is given by this formula.

Now consider looking up an item that was not added to the bloom filter. The probability that it will report a false positive can be found by computing the likelihood that each location within the bloom filter is a 1 for all k hashing functions. This is expressed as follows.

This formula contains a sequence that can be approximated using the natural log [8] as

Using this formula it is possible to solve for m given an n and desired probability, p, of false positives. The formula is as follows.

Finally, solving for k above results in the following formula.

These two formulas tell us how many bits are required in our filter to guarantee a maximum specified rate of false positives. We can also compute the required number of hash functions. For instance, for an English dictionary containing 109,583 words and a desired false postive percentage of no more than 1 % (expressed as 0.01 in the formula) requires a bit array of 1,050,360 bits and seven hashing functions.

The number of bits in this example may seem excessive. However, recall that they are bits. An efficient implementation requires roughly 128 KB of storage. The number of characters in the English dictionary used in these examples totals 935,171. Assuming 1 byte per character, storing all these words would require a minimum of 914 KB. The bloom filter represents quite a savings in space. In addition, during experiments the lookup time using the bloom filter never took longer than 160 ┬Ás. The lookup time is bounded by the number and efficiency of the hash functions used to compute the desired values. Assuming that the hash functions are dependent on the length of the string being hashed, then the lookup time is O(lk) where l is given by the length of the item being looked up and k is the number of hash functions.

Drawbacks of a Bloom Filter

Besides the obvious false positive potential, the bloom filter can only report yes or no. It can't suggest alternatives for items that might be close to being spelled correctly. A bloom filter has no memory of which bits were set by which items so a yes or no answer is the best we can get with even a yes answer not being correct in some circumstances. The next section presents a Trie data structure that will not report false positives and can be used to find alternatives for incorrectly spelled words.

 
< Prev   CONTENTS   Next >