eXept Software AG Logo

Smalltalk/X Webserver

Documentation of class 'BloomFilter':

Home

Documentation
www.exept.de
Everywhere
for:
[back]

Class: BloomFilter


Inheritance:

   Object
   |
   +--Collection
      |
      +--BloomFilter

Package:
stx:libbasic2
Category:
Collections-Unordered
Version:
rev: 1.8 date: 2024/03/13 09:02:10
user: cg
file: BloomFilter.st directory: libbasic2
module: stx stc-classLibrary: libbasic2

Description:


A Bloom filter is a space-efficient probabilistic data structure, 
conceived by Burton Howard Bloom in 1970 ('Space/Time Tradeoffs in Hash Coding with Allowable Errors'),
that represents a set of elements and allows you to test if an element is a membership.

A BloomFilter (see also in https://de.wikipedia.org/wiki/Bloomfilter)
is a Set-like data structure which can quickly answer whether
an element is DEFINITELY NOT in the set, but might generate FALSE POSITIVES
(i.e. it might answer true, although the element is not actually present).
Thus a true answer from the 'íncludes:' query MUST be interpreted as a 'maybe'

Bloom filters are used for tuning, when accesses (eg. to the disk or database)
are expensive, and can be avoided in many cases (in the NOT present case),
and false positives only lead to additional useless queries.

Bloom filters cannot remove or enumerate elements. 
The only useful operations are adding and inclusion tests (includes:).

For example, a database query (eg. 'is something present'), 
might make use of a bloom filter, to avoid the query in many cases.

A Bloom filter's false positive rate degrades as the number of elements is added,
thus a good choice of its initial size is mandatory (number of bits).

See the example, where the false positive rates are:
2 hashes:
    100% with 1000 bits,  
    73% with 5000 bits,  
    37% with 10000 bits,
    15% with 20000 bits
    5.6% with 30000 bits
    4.0% with 50000 bits
3 hashes:
    100% with 1000 bits,  
    85% with 5000 bits,  
    43% with 10000 bits,
    14% with 20000 bits
    5.9% with 30000 bits
    1.2% with 50000 bits

[complexity:]
    adding      O(1)
    testing     O(1)
    removing    -
    searching   O(1)

[caveat:]
    inheriting from Collection is not a good idea;
    we may need something like a CollectionLike abstract superclass.

copyright

COPYRIGHT (c) 2019 by Claus Gittinger All Rights Reserved This software is furnished under a license and may be used only in accordance with the terms of that license and with the inclusion of the above copyright notice. This software may not be provided or otherwise made available to, or used by, any other person. No title to or ownership of the software is hereby transferred.

moreDocumentation

Donald Knuth in his famous 'The Art of Computer Programming' writes the following about Bloom filters: Imagine a search application with a large database in which no calculation needs to be done if the search was unsucessful. For example, we might want to check somebody's credit rating or passport number, and if no record for that person appears in the file we don't have to investigate further. Similarly in an application to computerized typesetting, we might have a simple algorithm that hyphenates most words correctly, but it fails on some 50,000 exception words; if we don't find the word in the exception file we are free to use the simple algorithm. Also Broder and Mitzenmacher stablished the following principle: Whenever a list or set is used, and space is at a premium, consider using a Bloom filter if the effect of false positives can be mitigated. Examples from the Real World - Medium uses Bloom filters to avoid recommending articles a user has previously read. - Google Chrome uses a Bloom filter to identify malicious URLs. - Google BigTable, Apache HBase and Apache Cassandra use Bloom filters to reduce the disk lookups for non-existing rows or columns. - The Squid Web proxy uses Bloom filters for cache digests. Basic understanding The Bloom Filter can store a large set very efficiently by discarding the identity of the elements; i.e. it stores only a set of bits corresponding to some number of hash functions that are applied to each added element by the algorithm. Practically, the Bloom Filter is represented by a bit array and can be described by its length (m) and number of different hash functions (k). The structure supports only two operations: - Adding and element into the set, and - Testing whether an element is or is not a member of the set. The Bloom Filter data structure is a bit array where at the beginning all bits are equal to zero, meaning the filter is empty. For example, consider the following Bloom filter example created to represent a set of 10 elements with a false positive probability (FPP) of 0.1 (i.e. 10%). This empty filter will be backed by a bit array of length m=48 (storageSize) and 4 hash functions to produce values in the range {1, 2, ..., m}. To insert an element x into the filter, for every hash function h we compute its value j=h(x) on the element x and set the corresponding bit j in the filter to one. As an example, we are going to insert some names of cities into the previous filter. Let's start with 'Madrid': In order to find the corresponding bits to be set for 'Madrid', the filter computes the corresponding 4 hash values. Now assume that filter has set the bits 9, 18, 39 and 48 (* the bit numbers depend on the hash algorithm, and the numbers written here will not correspond to the ST/X Bloom filter's actual bits, as it uses a different hash algrithm and different filter size than the one presented in the text). It is possible that different elements can share bits, for instance, let's add another city, 'Barcelona', to the same filter. After adding the String 'Barcelona' to the previous Bloom filter only the bits 30, 36 and 42 have been set, meaning that the elements 'Madrid' and 'Barcelona' share one of their corresponding bits. To test if a given element x is in the filter, all k hash functions are computed and check bits in the corresponding positions. If all bits are set to one, then the element x *may exist* in the filter. Otherwise, the element x is *definitely not* in the filter. The uncertainty about the element's existence originates from the possibility of situations when some bits are set by different elements added previously. Consider the previous Bloom filter example with the elements 'Madrid' and 'Barcelona' already added. To test if the element 'Barcelona' is a member, the filter computes its 4 hash values and check that the corresponding bits in the filter are set to one, therefore the String 'Barcelona' may exist in the filter and the includes: 'Barcelona' returns true: Now if we check the element 'Berlin', the filter computes its hashes in order to find that the corresponding bits in the filter are not all in the filter, therefore the 'Berlin' element is definitely not in the filter, so the includes: method returns false: A Bloom filter can also result in a false positive result. For example, consider the element 'Roma', whose hash values collision in a bit that is already set in our filter example, so the result of the includes: method is that the element may exist in the filter: As we know, we have not added that element to the filter, so this is an example of a false positive event. In this particular case, the bit was set previously by the 'Barcelona' element.

Class protocol:

defaults
o  defaultHashes
the default here is to use two hashes

o  defaultNumberOfBits
the default here is to use 1000 bits

instance creation
o  new
the default here is to use 1000 bits and two hashes

o  new: initialNumberOfBits
the default here is to use two hashes

o  new: initialNumberOfBits hashes: hashActionBlocks
return a filter using the given number of bits and hashes


Instance protocol:

adding
o  add: aString
add the argument, aString to the set

blocked
o  do: aBlock
although I implement add, I do not remember previous elements,
and therefore cannot enumerate them.
all I do is to provide a definite 'not included' answer

o  remove: aString
raise an error - elements cannot be removed

initialization
o  numberOfBits: initialNumberOfBits hashes: hashingBlocksArg

printing
o  displayOn: aStream
cannot enumerate; thus use Object's displayOn: method

o  printOn: aStream
cannot enumerate; thus use Object's printOn: method

queries
o  includes: aString
returns false, if the argument, aString is *definitly NOT* in the set.
However, it may return true as a false positive


Examples:


<<END
    |filter|

    filter := BloomFilter new.
    filter add:'Madrid'.
    filter add:'Barcelona'.
    filter add:'Stuttgart'.
    filter add:'Müncheh'.
    Transcript show:'Madrid: '  ; showCR:(filter includes:'Madrid'). 
    Transcript show:'Barcelona: '; showCR:(filter includes:'Barcelona').     
    Transcript show:'Berlin: '   ; showCR:(filter includes:'Berlin').     
    Transcript show:'Roma: '     ; showCR:(filter includes:'Roma').     

    |filter set nFalsePositive|

    #(1000 5000 10000 20000 30000 50000) do:[:nrOfBits |
        filter := BloomFilter new:nrOfBits.
        set := Set new:10000.

        "/ add all words from the Collection class's source code
        Collection sourceStream 
            linesDo:[:eachLine |
                eachLine 
                   splitOn:[:ch | ch isSeparator]
                   do:[:word | 
                      filter add:word.
                      set add:word.
                   ].
            ];
            close.

        "/ check some which are known to be present;
        "/ the filter MUST return true
        set do:[:w |
            self assert:(filter includes:w)
        ].

        "/ check for false positives
        "/ the filter will return true for some words not in the set
        nFalsePositive :=
            (1 to:1000) count:[:n |
                |randomWord isFalsePositive|

                isFalsePositive := false.
                randomWord := (1 to:6) collect:[:i | ($a to: $z) atRandom] as:String.
                (set includes:randomWord) ifFalse:[
                    (filter includes:randomWord) ifTrue:[
                        "/ Transcript showCR:'false positive: ',randomWord.
                        isFalsePositive := true.
                    ].    
                ].
                isFalsePositive
            ].
        Transcript show:'bits: '; showCR:nrOfBits.
        Transcript show:nFalsePositive; show:' false positives'.
        Transcript show:' ('; show:(nFalsePositive asPercentFrom:1000.0); showCR:' %)'.
        Transcript showCR:'----------'.
    ]
END>>"

ST/X 7.7.0.0; WebServer 1.702 at 20f6060372b9.unknown:8081; Sat, 21 Dec 2024 16:33:58 GMT