Bloom Filters

Ғылым және технология

Thanks to Hostinger: hostinger.com/mcoding
Use coupon code MCODING at checkout to get an additional 10% off!
Bloom filters are a simple probabilistic kind of set that exemplify how using probability to your advantage can result in huge performance wins at scale. By giving up some of the flexibility of a full set interface and even getting the wrong answer to "is this element in the set?" on a small percentage of elements, we can reduce memory usage in our example web service from over 1 GB down to 10 MB to keep a "set" of 10M links, which in turn makes it practical to store the data on-server instead of calling out to an (imaginary) costly third party API, allowing us to respond to clients immediately 98% of the time, resulting in much faster response times and a lower API bill.
― mCoding with James Murphy (mcoding.io)
Source code: github.com/mCodingLLC/VideosS...
Bloom filters: en.wikipedia.org/wiki/Bloom_f...
Confused about new Python 3.12 syntax? • Python 3.12 is HERE!
Made with Manim: docs.manim.community/
Sponsored by Hostinger: hostinger.com/mcoding
SUPPORT ME ⭐
---------------------------------------------------
Sign up on Patreon to get your donor role and early access to videos!
/ mcoding
Feeling generous but don't have a Patreon? Donate via PayPal! (No sign up needed.)
www.paypal.com/donate/?hosted...
Want to donate crypto? Check out the rest of my supported donations on my website!
mcoding.io/donate
Top patrons and donors: Jameson, Laura M, Dragos C, Vahnekie, Neel R, Matt R, Johan A, Casey G, Mark M, Mutual Information, Pi
BE ACTIVE IN MY COMMUNITY 😄
---------------------------------------------------
Discord: / discord
Github: github.com/mCodingLLC/
Reddit: / mcoding
Facebook: / james.mcoding
CHAPTERS
---------------------------------------------------
0:00 Intro
0:54 Sponsored message
1:50 Bloom filter interface
2:21 Why bloom filters?
3:10 How does it work?
4:31 Analyzing the False Positve Rate
6:33 Where to get k hashes?
7:30 Bloom filter implementation
8:44 Bit Arrays
9:32 Running the example code
11:11 Thanks

Пікірлер: 74

  • @mCoding
    @mCoding6 ай бұрын

    Thanks to this video's sponsor, Hostinger: hostinger.com/mcoding. Don't forget to use coupon code MCODING at checkout to get an additional 10% off!

  • @Shaft0

    @Shaft0

    6 ай бұрын

    Code for this one does not seem to be uploaded to the usual github location

  • @mCoding

    @mCoding

    6 ай бұрын

    The code is up now! github.com/mCodingLLC/VideosSampleCode/blob/master/videos/132_bloom_filter/bloom_filter.py@@Shaft0

  • @garyantonyo
    @garyantonyo6 ай бұрын

    Love that you went through the math (even if it was a bit hand wavey for brevity). This feels like an actually useful data structure, and one that I was not aware of.

  • @mCoding

    @mCoding

    6 ай бұрын

    Details are left as an exercise to the reader :)

  • @Zaniahiononzenbei
    @Zaniahiononzenbei6 ай бұрын

    Shadows are a brilliant comparison for bloom filters. I love this, thanks!

  • @TheOriginalSentack
    @TheOriginalSentack6 ай бұрын

    That was a fascinating video and I can think of a few places in my code base that could use this to help reduce the number of unnecessary calls to an API. This also feels like it should be a Python package someone implements in C on the back-end.

  • @aonodensetsu

    @aonodensetsu

    6 ай бұрын

    there's a rust-for-python package called rbloom

  • @loganrussell48
    @loganrussell485 ай бұрын

    I’ve always used a single hash function, with several smaller bloom “filters”(bit arrays), each with a length that is prime, and mod the hash by the length of each “filter”. Assuming that your hash function is good, this makes the false positive calculation very easy. You take the proportion of set bits in each bit array and multiply them all together. Ex: your filters are of size 11, 13, and 17 (small for demonstration). And the number of set bits are 4, 4, 6, then the false positive rate is (4/11) * (4/13) * (6/17) = 96/2431 = ~ 4% If your application involves likely getting a lot of “definitely not” responses, you could sort the set of bit arrays by fullness, so the subarray that is most likely to say “definitely not” is tried first, and so on down the line, to speed up ever so slightly the decision process. The other benefit here is if your memory is super constrained (or your bloom filters massively large) and you don’t have a big enough chunk of memory available for the full bit array contiguously in memory, several smaller bit arrays may still be able to be allocated in the spaces of memory available. Very niche use case there, but still.

  • @arisweedler4703
    @arisweedler47036 ай бұрын

    I love your videos. High information density to explain the main idea so simply and so perfectly and so fast. Then minutes on practical explanation

  • @glorytoarstotzka330
    @glorytoarstotzka3306 ай бұрын

    thanks a lot for making the explanations dark mode, it is eye burning for me to use white mode, mostly cuz I got an eyes issue. it doesn't matter how hard I use a blue filter or how much ambient light I have in the room, full white screen hurts within 10 mins max

  • @quintrankid8045

    @quintrankid8045

    6 ай бұрын

    Does your browser or OS have a method to invert color?

  • @glorytoarstotzka330

    @glorytoarstotzka330

    6 ай бұрын

    ​@@quintrankid8045 I am not sure about that. I didn't try to search. I just use Dark Reader to darkmode everything and set a shortcut to toggle it very fast when it breaks

  • @djixi98

    @djixi98

    6 ай бұрын

    ​@@quintrankid8045mine does, one of the few reasons why I exclusively use Opera GX and their force dark mode for all websites lol (other being mouse gestures). Win11 has a nice dark theme built in. Additionally, all GUIs I write have dark themes as well lol

  • @csanadtemesvari9251

    @csanadtemesvari9251

    6 ай бұрын

    Skill issue

  • @arisweedler4703
    @arisweedler47036 ай бұрын

    I failed to parse “numerical hot loop” even though I understood the components. Imma break it down for my own edification * loop -> runs multiple times * hot -> on the critical path * numerical -> doing numeric computations. Why is this even a relevant word? Well because Python addition (and all other numeric operations) are slower than addition accelerated by a library like numpy (especially for parallelizable stuff like working with arrays) because you have to update the values in the python VM (through doing them in the CPU then storing them in memory) vs. numpy or other libs that let you just do more in the CPU. Also the GIL stops multiple processes from working at once. So a “numerical hot loop” -> slower-than-necessary computations on the critical path and you run it multiple times.

  • @nadavgover6017
    @nadavgover60176 ай бұрын

    Best video of the week if not the month. Every second I was fascinated, even the sponsorship was spot on

  • @sufilevy
    @sufilevy6 ай бұрын

    I have to say, in general I'm not a big fan of Python, but the way that you write code kind of makes me like it again!

  • @MrTrebor2
    @MrTrebor26 ай бұрын

    Outstanding! Big motivation to learn algos & data structures. Usecase and shadow comparison priceless!

  • @Khushpich
    @Khushpich6 ай бұрын

    great video introduction to bloom filtrers!

  • @jpay
    @jpay6 ай бұрын

    At 6:50 it's stated the hash functions must be at least 27 bits long. Then at 7:10 the 256-bit hash function gets divided into four chunks of 48 bits - could we have divided this into *nine* chunks of 27 bits (with some room leftover)? Where did the 48 bits come from - was it because we want 5 hash functions and also want to maximally use the big 256-bit function??

  • @volodymyrchelnokov8175

    @volodymyrchelnokov8175

    6 ай бұрын

    if you have an array size M = 80,000,000 like in the example you need 26.25 hash bits, but now if you take 27 bit hash you get numbers from 0 to 2^27-1 = 134,217,727 so you need to take this number modulo M to get the array index. But then the indices from 0 to 54,217,727 are twice as probable as other ones, and the filter becomes less efective - to avoid this kind of problem it is better to take larger chunks of bits before division. Ofc you can take M = 2^27 and forget both about this problem and the need to divide.

  • @luizzeroxis
    @luizzeroxis6 ай бұрын

    That's really smart, thanks for showing the idea on the video, it's a good one to have in your pocket

  • @shrayesraman5192
    @shrayesraman51926 ай бұрын

    Bloom filters have major application in computational biology. So funny that you posted this video. I was talkiny about this earlier haha.

  • @ShredWithKyle
    @ShredWithKyle4 ай бұрын

    Really appreciate this video, thank you

  • @osmarperez2341
    @osmarperez23416 ай бұрын

    Awesome content! Thank you so much.

  • @siddharth-gandhi
    @siddharth-gandhi6 ай бұрын

    if you made explanations for algorithms like this, maybe i'd get some motivation to do DS/Algo for interview prep. Great video!

  • @Not_Even_Wrong
    @Not_Even_Wrong6 ай бұрын

    Amazing stuff. Cheers.

  • @quintencabo
    @quintencabo6 ай бұрын

    There are also hash functions where you can set the output lenght.

  • @andremaldonado7410
    @andremaldonado74106 ай бұрын

    I swear, I always learn something new from you. Greatest gift one can get is experience, and I'm very grateful you are generous with yours

  • @user-hk3ej4hk7m
    @user-hk3ej4hk7m6 ай бұрын

    This is really cool, thanks for sharing. Since python ints are essentially any length, do you think a more performant implementation could use the a really large int as the memory, or would it mess with the way int is implemented?

  • @roelant8069

    @roelant8069

    6 ай бұрын

    Accidentally sent my earlier comment earlier (and I don't see it so idk if it's even there) Anyway, a reason not to use python ints for this is that you need to do bit manipulations on them and I don't know how receptive they are to that (you can definitely do that but having a special class for it will make it a lot less abstract)

  • @user-ex8dk3ic3x
    @user-ex8dk3ic3x6 ай бұрын

    Fascinating stuff as always. Could you do a coding vid on the enhanced trial division method to see how much faster it is than trial division?

  • @lukekurlandski7653
    @lukekurlandski76536 ай бұрын

    Facinating.

  • @JonathanZigler
    @JonathanZigler6 ай бұрын

    I'm sure one could keep a handful of known good link filters too to eliminate any fuzzed links

  • @yiannifunataoldotcom
    @yiannifunataoldotcom6 ай бұрын

    That was really impressive. Thanks for sharing!

  • @jpay
    @jpay6 ай бұрын

    I'm misunderstanding something at 5:22. The probability that a bit is set after n insertions [call this p] is p = 1 - (1 - 1/m)^(kn). But then we say the probability is a false positive rate [call this r] is just r = p^k. But isn't this something that's *exactly* represented by binomial probability? I.e. the probability of one false positive (call this t_1) is t_1 = (n choose 1) * (p)^1 * (1-p)^(n-1), and the probability of two false positives is t_2 = (n choose 2) * (p)^2 * (1-p)^(n-2), and so on. So the false positive rate is *actually* equal to the sum of all the probabilities of getting each number of false positives. i.e. r = t_1 + t_2 + t_3 + ... + t_k. What am I misapplying here? Are they equivalent? I couldn't get the algebra to work showing they are equal.

  • @volodymyrchelnokov8175

    @volodymyrchelnokov8175

    6 ай бұрын

    for false positive rate we consider a probability that some element not added to the filter, is recognized as already added false positive = all k bits with indices generated by the hashes are set - probability P = p^k true negative = at least one bit is not set - probability P' = 1 - p^k = sum (k choose i) * (1-p)^i * p^(k-i) for i in 1..k

  • @Bigfootdu78
    @Bigfootdu785 ай бұрын

    Very nice video as always. One question: I don't understand why many hash functions are needed. The example hash function from your video set a bit uniformly. Why not just call k times this function ?

  • @mCoding

    @mCoding

    5 ай бұрын

    Great question, since hash functions are not truly random but rather deterministic functions, applying the same hash k times will yield the same value each time, setting only 1 bit of shadow instead of k bits of shadow. This violates the independence assumption we made to perform the analysis and will result in the performance characteristics of a bloom filter with k=1, which is not necessarily bad, but you could do much better with k=5.

  • @vsolyomi
    @vsolyomi6 ай бұрын

    Where do you learn about those to begin with? In CS grade course? In the wild? Is there a book of spells for such things?

  • @yudhiesh1997
    @yudhiesh19976 ай бұрын

    Cool thing is that Redis has a built-in Bloom Filter for you to use instead of implementing and managing it yourself.

  • @crystalvoyager2495
    @crystalvoyager24956 ай бұрын

    "for a fixed bit, the chance that one hash of a single element will set that bit is 1/m": I don't get this. This would mean that the hash only sets one bit, hence the chance of a fixed bit to be set would be 1/m. But my understanding is that a hash will set many bits, in fact, on average half of the bits, so I would expect the chance of a bit to be 1 when hashing one element is 1/2. can anyone explain please?

  • @piguyalamode164

    @piguyalamode164

    6 ай бұрын

    I think the idea is that when we add an element e to the set, we set the bit at index k_i(e) to 1 for each i. So, in other words, the normal hash output(maybe with some mod) is the index of the bit we set, not the bits we set

  • @crystalvoyager2495

    @crystalvoyager2495

    6 ай бұрын

    @@piguyalamode164 "the hash output is the index of the bit we set" that's what I misunderstood initially, thanks a lot for clearing that up!

  • @tchittesh

    @tchittesh

    6 ай бұрын

    @@crystalvoyager2495 I was also confused by this! I think I got thrown off by the animation at 3:45 , not understanding that the three arrows were created by each independent hash function. Thanks for the explainer @piguyalamode164

  • @RandomGeometryDashStuff
    @RandomGeometryDashStuff6 ай бұрын

    03:37 is using 1 small buffer PER hash function worse than 1 big buffer for all hash functions?

  • @KappaHakka
    @KappaHakka6 ай бұрын

    I don't understand something. Considering this use case wouldn't it mean that you would need to have the whole dataset to initialize the bloom filter? Otherwise how can you know if a site is definitely not in the set without querying the service?

  • @HTWwpzIuqaObMt
    @HTWwpzIuqaObMt6 ай бұрын

    Nice

  • @hoteny
    @hoteny6 ай бұрын

    Soooo the more memory the more accuracy or no? (4:27)

  • @SpeedingFlare
    @SpeedingFlare6 ай бұрын

    If I had enough disk on the server, I'd check the "probably" response against a local sqlite db of the strings indexed by URL. And if the links service had an API for new bad links, I'd pull from that to keep the bloom filter and db updated. I dunno

  • @efperel
    @efperel6 ай бұрын

    What if the array is all 1s? Won’t it return a false positive for any element?

  • @Pownyan

    @Pownyan

    6 ай бұрын

    That is why you get worse results by using too many hash functions. At infinite hash functions all bits are set to 1

  • @NateROCKS112

    @NateROCKS112

    6 ай бұрын

    It would, but that is impossible if the number of hashes is less than the number of bits per element.

  • @skellt
    @skellt6 ай бұрын

    It seems like these cool "memory-saving" data structures are becoming less and less used as more and more companies just rely on increasing cloud resources. Thanks for including the mathematics on the parameter variables, I think it's much more interesting than just running brute force tests.

  • @gregorymorse8423

    @gregorymorse8423

    6 ай бұрын

    It seems you dont know what you are talking about. Any cloud engineer worth their weight in salt knows when and how to apply such techniques. It seems you are making irrational assumptions without evidence.

  • @t_kon

    @t_kon

    6 ай бұрын

    You're just shifting bloom filter to the cloud provider. In the background, such technique are all used in cloud infrastructure as well.

  • @VeejayRampay
    @VeejayRampay6 ай бұрын

    because you don't delete from bloom filters, i feel like _any number_ of zeros at the locations given by the hash functions indicates absence (you don't need to check all of them)

  • @mCoding

    @mCoding

    6 ай бұрын

    Yep, you got it! The contains method returns True if they are all set, so it returns False if *any* are not set.

  • @anon_y_mousse
    @anon_y_mousse6 ай бұрын

    I'd swear I learned this algorithm with a different name, but I can't seem to find it. Maybe the source made up a name because they didn't know where it came from and I should've always known it as a Bloom filter. Weird either way.

  • @kenny-kvibe
    @kenny-kvibe6 ай бұрын

    Looks like Subnet Masking, but instead of passing by bit:0, it passes by bit:1 through OR comparison, kinda wierd..what happens if the "possible" is acutally "no", is it checked again ??

  • @bearwolffish

    @bearwolffish

    6 ай бұрын

    Usually some more computationally expensive check if it's a maybe. Sometimes will be nested bloom filters. Think ethereum block bloom filters. To scan for a swap topic in a transaction, rather than check every transaction of every block you can check the block filter and if all bits are set then maybe there is a swap event in there. Then check the transaction bloom filters from the maybe's to find which may have the topic, and only on these maybe's will we bother pulling a transaction receipt and matching hashes to find what we want to decode.

  • @danielglauche6095
    @danielglauche60956 ай бұрын

    splunk is using bloomfilters heavily

  • @stan4676
    @stan46766 ай бұрын

    I want more

  • @jlhjlh
    @jlhjlh6 ай бұрын

    Please consider this video's sponsor... me! That's right I'm sponsoring myse.... WAIT, WHAT? Not this time?

  • @QuantumHistorian
    @QuantumHistorian6 ай бұрын

    That was... fast. Especially the section starting at 3:30 that explains the main idea. The meat of the whole video is in the 30 seconds that follow, and honestly just felt super rushed. Took me a few attempts to understand that you don't store each hash in memory directly, but that each hash is an address to where in memory you set a bit to 1.

  • @Mateusz143
    @Mateusz1434 ай бұрын

    I was like ooooh damn now we surely got to implement this in C or Rust…. Then you hit us with Python and my heart stopped 😢

  • @simonnjoroge933
    @simonnjoroge9336 ай бұрын

    I also want a PhD.

  • @bereck7735
    @bereck77356 ай бұрын

    discord gang.

  • @zizzyballuba4373
    @zizzyballuba43736 ай бұрын

    i understood nothing

  • @mikeshardmind
    @mikeshardmind6 ай бұрын

    This one's a bit disappointing to lead people to, as there are better structures for this purpose now. Cuckoo filters and binary fuse filters each have better characteristics.

  • @muhdiversity7409
    @muhdiversity74096 ай бұрын

    This sounds like it's voiced by AI

  • @palapapa0201
    @palapapa02016 ай бұрын

    8:44 Or stop using python

Келесі