|
Class: BloomFilter
Object
|
+--Collection
|
+--BloomFilter
- Package:
- stx:libbasic2
- Category:
- Collections-Unordered
- Version:
- rev:
1.8
date: 2024/03/13 09:02:10
- user: cg
- file: BloomFilter.st directory: libbasic2
- module: stx stc-classLibrary: libbasic2
A Bloom filter is a space-efficient probabilistic data structure,
conceived by Burton Howard Bloom in 1970 ('Space/Time Tradeoffs in Hash Coding with Allowable Errors'),
that represents a set of elements and allows you to test if an element is a membership.
A BloomFilter (see also in https://de.wikipedia.org/wiki/Bloomfilter)
is a Set-like data structure which can quickly answer whether
an element is DEFINITELY NOT in the set, but might generate FALSE POSITIVES
(i.e. it might answer true, although the element is not actually present).
Thus a true answer from the 'íncludes:' query MUST be interpreted as a 'maybe'
Bloom filters are used for tuning, when accesses (eg. to the disk or database)
are expensive, and can be avoided in many cases (in the NOT present case),
and false positives only lead to additional useless queries.
Bloom filters cannot remove or enumerate elements.
The only useful operations are adding and inclusion tests (includes:).
For example, a database query (eg. 'is something present'),
might make use of a bloom filter, to avoid the query in many cases.
A Bloom filter's false positive rate degrades as the number of elements is added,
thus a good choice of its initial size is mandatory (number of bits).
See the example, where the false positive rates are:
2 hashes:
100% with 1000 bits,
73% with 5000 bits,
37% with 10000 bits,
15% with 20000 bits
5.6% with 30000 bits
4.0% with 50000 bits
3 hashes:
100% with 1000 bits,
85% with 5000 bits,
43% with 10000 bits,
14% with 20000 bits
5.9% with 30000 bits
1.2% with 50000 bits
[complexity:]
adding O(1)
testing O(1)
removing -
searching O(1)
[caveat:]
inheriting from Collection is not a good idea;
we may need something like a CollectionLike abstract superclass.
copyrightCOPYRIGHT (c) 2019 by Claus Gittinger
All Rights Reserved
This software is furnished under a license and may be used
only in accordance with the terms of that license and with the
inclusion of the above copyright notice. This software may not
be provided or otherwise made available to, or used by, any
other person. No title to or ownership of the software is
hereby transferred.
moreDocumentationDonald Knuth in his famous 'The Art of Computer Programming' writes the following about Bloom filters:
Imagine a search application with a large database in which no calculation needs to be done
if the search was unsucessful.
For example, we might want to check somebody's credit rating or passport number,
and if no record for that person appears in the file we don't have to investigate further.
Similarly in an application to computerized typesetting,
we might have a simple algorithm that hyphenates most words correctly,
but it fails on some 50,000 exception words;
if we don't find the word in the exception file we are free to use the simple algorithm.
Also Broder and Mitzenmacher stablished the following principle:
Whenever a list or set is used, and space is at a premium,
consider using a Bloom filter if the effect of false positives can be mitigated.
Examples from the Real World
- Medium uses Bloom filters to avoid recommending articles a user has previously read.
- Google Chrome uses a Bloom filter to identify malicious URLs.
- Google BigTable, Apache HBase and Apache Cassandra use Bloom filters to reduce the disk lookups for non-existing rows or columns.
- The Squid Web proxy uses Bloom filters for cache digests.
Basic understanding
The Bloom Filter can store a large set very efficiently by discarding the identity of the elements;
i.e. it stores only a set of bits corresponding to some number of hash functions that are applied
to each added element by the algorithm.
Practically, the Bloom Filter is represented by a bit array and can be described by its length
(m) and number of different hash functions (k).
The structure supports only two operations:
- Adding and element into the set, and
- Testing whether an element is or is not a member of the set.
The Bloom Filter data structure is a bit array where at the beginning all bits are equal to zero,
meaning the filter is empty.
For example, consider the following Bloom filter example created to represent a set of 10 elements with a
false positive probability (FPP) of 0.1 (i.e. 10%).
This empty filter will be backed by a bit array of length m=48 (storageSize) and 4 hash functions
to produce values in the range {1, 2, ..., m}.
To insert an element x into the filter, for every hash function h we compute its value j=h(x) on the element x
and set the corresponding bit j in the filter to one.
As an example, we are going to insert some names of cities into the previous filter. Let's start with 'Madrid':
In order to find the corresponding bits to be set for 'Madrid', the filter computes the corresponding 4 hash values.
Now assume that filter has set the bits 9, 18, 39 and 48
(* the bit numbers depend on the hash algorithm, and the numbers written here will not correspond to the ST/X Bloom filter's
actual bits, as it uses a different hash algrithm and different filter size than the one presented in the text).
It is possible that different elements can share bits, for instance, let's add another city, 'Barcelona',
to the same filter. After adding the String 'Barcelona' to the previous Bloom filter only the bits 30, 36 and 42
have been set, meaning that the elements 'Madrid' and 'Barcelona' share one of their corresponding bits.
To test if a given element x is in the filter, all k hash functions are computed and check bits in the corresponding
positions. If all bits are set to one, then the element x *may exist* in the filter.
Otherwise, the element x is *definitely not* in the filter.
The uncertainty about the element's existence originates from the possibility of situations when some bits are set
by different elements added previously.
Consider the previous Bloom filter example with the elements 'Madrid' and 'Barcelona' already added.
To test if the element 'Barcelona' is a member, the filter computes its 4 hash values and check that the corresponding bits
in the filter are set to one, therefore the String 'Barcelona' may exist in the filter and the includes: 'Barcelona'
returns true:
Now if we check the element 'Berlin', the filter computes its hashes in order to find that the corresponding bits in the filter
are not all in the filter, therefore the 'Berlin' element is definitely not in the filter,
so the includes: method returns false:
A Bloom filter can also result in a false positive result.
For example, consider the element 'Roma',
whose hash values collision in a bit that is already set in our filter example,
so the result of the includes: method is that the element may exist in the filter:
As we know, we have not added that element to the filter, so this is an example of a false positive event.
In this particular case, the bit was set previously by the 'Barcelona' element.
defaults
-
defaultHashes
-
the default here is to use two hashes
-
defaultNumberOfBits
-
the default here is to use 1000 bits
instance creation
-
new
-
the default here is to use 1000 bits and two hashes
-
new: initialNumberOfBits
-
the default here is to use two hashes
-
new: initialNumberOfBits hashes: hashActionBlocks
-
return a filter using the given number of bits and hashes
adding
-
add: aString
-
add the argument, aString to the set
blocked
-
do: aBlock
-
although I implement add, I do not remember previous elements,
and therefore cannot enumerate them.
all I do is to provide a definite 'not included' answer
-
remove: aString
-
raise an error - elements cannot be removed
initialization
-
numberOfBits: initialNumberOfBits hashes: hashingBlocksArg
-
printing
-
displayOn: aStream
-
cannot enumerate; thus use Object's displayOn: method
-
printOn: aStream
-
cannot enumerate; thus use Object's printOn: method
queries
-
includes: aString
-
returns false, if the argument, aString is *definitly NOT* in the set.
However, it may return true as a false positive
<<END
|filter|
filter := BloomFilter new.
filter add:'Madrid'.
filter add:'Barcelona'.
filter add:'Stuttgart'.
filter add:'Müncheh'.
Transcript show:'Madrid: ' ; showCR:(filter includes:'Madrid').
Transcript show:'Barcelona: '; showCR:(filter includes:'Barcelona').
Transcript show:'Berlin: ' ; showCR:(filter includes:'Berlin').
Transcript show:'Roma: ' ; showCR:(filter includes:'Roma').
|
|filter set nFalsePositive|
#(1000 5000 10000 20000 30000 50000) do:[:nrOfBits |
filter := BloomFilter new:nrOfBits.
set := Set new:10000.
"/ add all words from the Collection class's source code
Collection sourceStream
linesDo:[:eachLine |
eachLine
splitOn:[:ch | ch isSeparator]
do:[:word |
filter add:word.
set add:word.
].
];
close.
"/ check some which are known to be present;
"/ the filter MUST return true
set do:[:w |
self assert:(filter includes:w)
].
"/ check for false positives
"/ the filter will return true for some words not in the set
nFalsePositive :=
(1 to:1000) count:[:n |
|randomWord isFalsePositive|
isFalsePositive := false.
randomWord := (1 to:6) collect:[:i | ($a to: $z) atRandom] as:String.
(set includes:randomWord) ifFalse:[
(filter includes:randomWord) ifTrue:[
"/ Transcript showCR:'false positive: ',randomWord.
isFalsePositive := true.
].
].
isFalsePositive
].
Transcript show:'bits: '; showCR:nrOfBits.
Transcript show:nFalsePositive; show:' false positives'.
Transcript show:' ('; show:(nFalsePositive asPercentFrom:1000.0); showCR:' %)'.
Transcript showCR:'----------'.
]
| END>>"
|