Improved Pattern Generation for Bloom Filters with Bit Patterns
B. Hedström and I. Josefsson
Master's Thesis, Chalmers University of Technology, Jun 2018.
To see if an item is a member of a set is one of the most commonly occurring problems in computer science. The problem is deceptively simple: Given an item and a collection of several items, does the given collection host the provided item? Examples of this problem can be found all around in several different types of applications from web stores to computationally heavy research software. Consider a toy example of any service that requires authentication through login. Then for the user to access the service it is required that the user is registered in the user database, which the software needs to check before granting the user access. Applications of this problem can, however, be much more sophisticated than the above example. A more specific application is using set-membership to speed up database queries. A database query can be quite costly in terms of computational power, and to spend that time to try to fetch an item which does not exist can be an expensive operation. An idea to speed up this process can be to check if the key is actually present in the database before querying it. This can be done by having some structure hosting only the existing keys, which in turn may be so significantly smaller than the entire database that the structure fit in local memory. This allows for a fast "filtering" of queries that would otherwise just result in a negative response, and may in turn speed up the application significantly. Another application of set-membership is to detect duplicate data in large datasets efficiently. In this case, an application may iterate over a dataset and test each item against a different, initially empty set and insert the item into the set if the query is negative. Hence if the query returns positive, then the item has been encountered previously and can as such be reported as a duplicate. The above examples illustrate just a small subset of applications of set-membership, but regardless of application they have one thing in common: There must be some data structure for the set. A popular structure for this is a hash table which assigns each item to a unique key and allows one to fetch/check the item in constant time by providing the corresponding key. But as the set grows larger, so does the memory requirements for the hash table. When the memory usage reaches a critical point it might be impossible to accommodate the table on a quick-to-access memory such as the cache, which may result in costly operations just to query the table. However, for specific applications, this can be avoided by using a probabilistic data structure instead. In these cases, one can often significantly reduce the amount 1 1. Introduction of memory needed by allowing the structure sometimes to claim that an item is a member of a set when it is not. While this is not always applicable, for example, one would never want to claim that a user might have administrative rights on one’s system, in many cases a small number of false positives can be tolerated to allow the structure to fit on a quicker access memory. To illustrate this, we yet again consider the example of speeding up database-queries. If the database contains a large number of keys, it might be impossible to accommodate the entire set-structure in the cache memory. In those cases to see if the key exists requires a computationally expensive disc-read, which might only slow the overall application down. However, by allowing a small rate of false positives, the application might be able to filter out the vast majority of queries that would not give any result. The time wasted searching the database for the non-existing keys that slipped through the filter would then be dwarfed by the speed-gains of successfully filtering the vast majority of non-existing keys.
A reprint is available as PDF.