Does Bloom Filter is the efficient Data Structure for large DataBase Lookup?


Happy to see all the software enthusiastic landing to this post to know about Bloom Filter. There are various Data Structures available but what depends is the use case for which a particular Data structure will serve the purpose efficiently. 


Recently I was going through the Data structure to quickly lookup the database content and explored many props and conclusions of various data structures according to my requirement. During the process, I heard about "Bloom Filter", an efficient data structure for a database lookup. As the name suggests it is primarily targeted to filter out some set of lists. 

Similarly, like other Data Structures Bloom Filter also has some disadvantages. Before diving into this let's understand the logic behind this.

The core logic is to get a hash from the input data and set a bit in the bit array while inserting data into the list. When lookup generates the hash key and checks the bit, if it is set, then data is present.

for example, if we are storing a list of employees/students in the database with the key being the employee name and the value being the details of the employee.

1st entry - Person1


for the above data generate the hash key and suppose let's say the hash key is 9 and 5

hash1("Person1") = 9;

hash2("Person1") = 5;


Now let's have bit array of size 32 bits

bit array = 0000 0000 0000 0000 0000 0000 0000 0000


enable/set bit 9 and 5

bit array = 0000 0000 0000 0000 0000 0001 0001 0000


Now let's take a second input

  1st entry - Person1


for the above data generate the hash key and suppose let's say the hash key is 1 and 5

hash1("Person2") = 1;

hash2("Person2") = 5;


set bit 1 and 5

bit array = 0000 0000 0000 0000 0000 0001 0001 0001


With the bit array now we can quickly do person lookup. Generate the hash key for the input person and check whether the bit is set or not. If the bit is set then the person entry is there in the list.


The disadvantage of this algorithm is some data may produce the subset of the key which is already set in the list. With this, there may be a wrong assumption that even though the data is not present we'll get the output as the data is present. This is not highly reliable for large mission-critical database lookups.

As earlier said pick an algorithm as per your need considering its advantages and disadvantages. With this, I'll end my short blog on this. Let's catch up in the next blog........:-)






Comments

Popular posts from this blog

RCU Kernel Implementation

PostgreSQL Write-Ahead-Logging(WAL) Archiving Functionality

Linux FTRACE setup using Terminal and GUI Application created from Python Framework