DepthBuffer
Күн бұрын
123,276
1

Adding Nested Loops Makes this Algorithm 120x FASTER?

Ғылым және технология

In the last video, I introduced the concepts of compute-bounded and memory-bounded tasks. This video takes a step further and uses the theory we discussed to optimize a famous memory-bounded algorithm.
Many of these tricks are counterintuitive but highly effective. By the end of the video, you'll find we can make the algorithm around 120x faster than the naive implementation.
Join the Discord server: / discord
My GitHub profile: github.com/fangjunzhou
The source code used in this video: github.com/fangjunzhou/blas-p...
Motion Canvas official repository: github.com/motion-canvas/moti...
My custom fork of Motion Canvas: github.com/fangjunzhou/motion...

Пікірлер: 175

@bernardcrnkovic37699 ай бұрын
dude, this video is so beautifully animated. i can't wrap my head around how much time you spent on this.
@mhavock
9 ай бұрын
yeah, cache you l8ter! 🤣
@minhuang8848
9 ай бұрын
The animation is sweet, but do people realize that this is a native ultrawide vid? IMMEDIATE subscription for choosing by far the best aspect ratio... y'know, the one we collectively failed to adopt. Seriously, not opting for 21:9 is the big tech fumble of our time, it's wild how narrow and tight 16:9 looks these days if you have some UW experience.
@gordonlawrenz7635
9 ай бұрын
@@minhuang8848 I like horizontal space. Give me a 32-40 inch 4:3. 4:3 is crazy good to work with.
@ImSidgr
9 ай бұрын
Good ol’ Canva
@itellyouforfree7238
9 ай бұрын
@@minhuang8848 21:9 is one of the worst inventions of out time
@Zullfix8 ай бұрын
The fact that optimizing for cache hits and enabling SIMD was able to bring a given matrix multiplication operation from 2000ms to 15ms is wild.
@okie9025
8 ай бұрын
It reminds me of when Matt Parker made a unique wordle combination finding algorithm that took a month to complete, and then his viewers managed to optimize it so it only takes a few ms.
@amirnuriev9092
7 ай бұрын
@@okie9025this is so not the same, both algorithms here are o(n^3), but with constant optimizations you reduce the runtime significantly in matt parkers video he basically wrote the most naive bruteforcey approach
@MichaelPohoreski
7 ай бұрын
@@okie9025Matt Parker *isn’t a programmer* so he implemented a naive solution in a dog-SLOW language Python. It took me a one day to write a C solution; this took 20 mins to run. The next morning I optimized it down to 20 seconds. In the evening I had multithreading working on my 24 core/48 thread Threadripper. It took ~2 seconds to find all 538 solutions. Understanding not only data access but calculations is paramount to getting high performance.
@BosonCollider
3 ай бұрын
@@amirnuriev9092 Right, this is the kind of speedup that programmers with only python experience would assume to be impossible (multiple order of magnitude improvement despite having the same asymptotic complexity, using the same language, and not even using any stupid premature abstractions in the initial version that could slow things down)
@foolriver9 ай бұрын
You can get better performance by unrolling the loops(do it aggresively, for example, unroll the whole loop by block size(in GCC, #pragma GCC unroll 8)). However, it would still be 3 times slower than intel mkl.
@axe863
8 ай бұрын
Why wouldnt the compiler pick that up? Make the compiler more aggressive???
@blipman17
8 ай бұрын
@@axe863GCC in particular doesn't like unrolling loops and vectorizing them as much as LLVM does in my experience. Although they seem to have caught up a lot with the latest few versions.
@axe863
8 ай бұрын
@blipman17 Thanks for the info.
@davidr2421
7 ай бұрын
@@blipman17 Yeah this is some info I was looking for. Based on what I've seen, LLVM would just automatically do these SIMD things without needing to be explicitly told, right?
@blipman17
7 ай бұрын
@@davidr2421 errr… yes and no. It really depends. Moving data to SIMD-registers and back has overhead compared to regular single instruction, single data (SISD) code that we normally program. Some SISD algorithm implementations in assembly can just be faster than the equivalent SIMD algorithms depending on the instruction set. Also, heat. All instructions generate some amounth of heat, which will affect the clock speed. So while an algorithm might be faster on SIMD, on a given processor it might be slower due to the CPU thermal-throttling. This is/was extreme apparent on AVX-512 SIMD instructions on Intel. That’s also why GCC used to be or is hesitant about SIMD instructions. Furtheron, any consumer/server cpu of the latest 20 years can run instructions out-of-order and in paralel. This is extremely apparent on for-loops where for example the 6’th iteration of the loop is started before the first iteration is even finished! That’s parallelism you get for free. And this all changes drastically per specific model of CPU + RAM and sometimes even between different cooling solutions. So… benchmark before making any such statements. This is extremely difficult.
@RoyerAdames8 ай бұрын
I saw your video in the ThePrimeTime, and your video is epic. Very well explain for such high level topics.
@just-a-hriday
3 ай бұрын
I believe it's actually rather low level.
@Affax9 ай бұрын
This video flew entirely above my head (still getting into this math in Uni haha), but the presentation is stunning, and I still watched through all of it because these optimizations are weirdly beautiful to me. Awesome job!
@stxnw
8 ай бұрын
u only need basic algebra bro..
@Val-vm6qu8 ай бұрын
The final part where you say "Oh but there is that library that does everything for you for that case" just made me laugh my head off, great vid full of deep commentary, really great job!
@trustytrojan9 ай бұрын
great animations and visual style! you've found yourself a great side hobby :)
@erikjohnson9112
9 ай бұрын
I was thinking the same. The matrix multiplication was very effective. I remembered the first matrix row gets a dot product with 2nd matrix's first column, so I could see that it was lower left being the first matrix and upper right being the second (lower right obviously being the result).
@1000percent10009 ай бұрын
I've been teaching myself how to use SIMD for the past 3 weeks and I can't even tell you how helpful this video was. I was baffled when my serial implementation of some image processing code was 10x faster than my naive SIMD implementation. Took me quite a while to understand how that was possible. This video has made my greatly appreciate the simplicity of Rust's (experimental) portable SIMD library. Also, did not know what OpenMP was and it seems somewhat similar to Rust's library. Absolutely incredible video!
@shimadabr9 ай бұрын
That's awesome! I've been studying parallel programming and a lot of these strategies i had no idea were possible. I wish my university had courses on HPC/Parallel programming like yours. This course you mentioned at the end of the video seems great.
@cefcephatus8 ай бұрын
I love how in depth this video makes me feel smart. I know that only a few people could make sense of such content like this. But, you make it feel like even more people can get close to it.
@fenril66859 ай бұрын
Subscribed. I can't believe you don't have more subscribers already. Any software engineer dealing with matrix math should watch this video.
@giordano77039 ай бұрын
I love it when videos like these give me inspiration to delve into a topic I'm not familiar with. I admire people like you for coming up with such elegant and beautiful ways to communicate these concepts.
@Dayal-Kumar9 ай бұрын
I did this exact thing in my intern project this summer at TI. Their processor even had a special instruction to do the matrix multiplication of two small blocks. I was able to achieve around 80% the rated capacity of the processor.
@MahdeenSky9 ай бұрын
I loved the presentation and the animations really helped me collect the concepts.
@nordicus6668 ай бұрын
Great videos, you certainly know what you are talking about and you can share it while keeping it interesting, keep it up, you deserve more subscribers
@Otomega19 ай бұрын
love this stuff keep the great work !
@naj09168 ай бұрын
Very good. I haven't thought about optimization for a long time since I was doing Assembly in the 90's. This brings back memories and the feeling. Very good!
@Antonio-yy2ec9 ай бұрын
Pure gold! Thanks for the video!
@U2VidWVz8 ай бұрын
Cool videos. Interesting information and great graphics/animations. Thanks!
@EricGT8 ай бұрын
this is a well done video and explains the idea of leveraging hardware and machine code to optimize cache lookups super well. But I also want to shout out what may be the best animation of a matrix dot product I’ve ever seen. this feels like the first time I watched a video that got me to understand what monads are
@LV-ii7bi7 ай бұрын
This seems really cool. Good work, you're on the right path.
@danh90027 ай бұрын
Awesome video! Thank you so much for sharing this!
@brucea98718 ай бұрын
Until now I only considered the efficiency of the algorithm for optimizing program execution. Although I admit I didn't understand everything in this video it demonstrates that knowledge of the computer's hardware and taking advantage of it can significantly speed up execution.
@tansanDOTeth8 ай бұрын
Visualizations were freaking amazing! Loved those! could you make a video of your process for editing these videos?
@camofelix8 ай бұрын
Very well done! I make a living optimizing BLAS routines, this will probably become my default “what do you do” to send people
@amadzarak77468 ай бұрын
GEMM is the perfect example to demonstrate these concepts. Wonderful video my friend. You earned a subscriber
@karim-gb5nx9 ай бұрын
Interesting subject, well presented, thank you.
@megaxlrful9 ай бұрын
Great video! But I feel like I will never get to do this kind of work during my job, since we use a scripting language, and all bets are off when you store everything on the heap.
@wiktorwektor123
9 ай бұрын
While scripting langauge is a huge problem here memory isn't. Stack and heap are just different names of how memory is allocated by OS kernel. All those types of memory are still sitting on the same RAM chip, so there is no difference in access speed and locality of that memory to CPU cache. While OS will almost always store stack memory in CPU cache, every time you access heap address it pull to cache and couple of more addresses. It's not that CPU loads only one value you request, it's value you request + some block of data after this. If you have an array of numbers and you want element zero, CPU is loading not only that but propably 1k more elements after.
@xdevs238 ай бұрын
This is so well explained and the animation are SO good! Also it would be interesting to see how clang performs compared to gcc.
@MohammedAbdulatef9 ай бұрын
The most captivating and insightful presentation I have ever witnessed.
@joshpauline8 ай бұрын
probably one of the best programming videos I have ever seen, as a more senior developer there is a lack of content on this level of production quality when explaining complex ideas
@dakata24168 ай бұрын
Great video!
@user-uz3fc3ty2n8 ай бұрын
This is so beautiful I nearly cried, and at the end when you reviled that the library had a 1000x optimisation I died and came back to life a better person.
@jacob_90s3 ай бұрын
Transposition works really well if you can combine it as a result of a previous operation. For instance if the matrix that's about to be multiplied is the result of adding together two matrices, when you're adding the elements of the matrix, reverse the order so that the result is the transposition of actual answer.
@gara81429 ай бұрын
This was a great watch, thanks you. That library you used, MKL, can you make a video about it? It sparked my interest
@HadiAriakia8 ай бұрын
Love the video, Where have you been mate until now?😂
@itellyouforfree72389 ай бұрын
very nicely done
@laughingvampire75553 ай бұрын
man, thank you for work, is amazing.
@woolfel9 ай бұрын
I could be wrong, but NVidia's linear algebra library does these types of optimizations to optimize utilization. The cublas library does a bunch of re-ordering and transposing by default.
@Proprogrammer0018 ай бұрын
This has so fucking deeply re-ignited my passion for computer science. Gosh I am on fire right now
@homematvej9 ай бұрын
We need to go deeper!
@rvoros9 ай бұрын
Nice video. I love Cities XL music too. ;)
@DeathSugar9 ай бұрын
there's also computational optimization similar to karatsuba multiplication but for matrices. probably proprietary lib uses it
@errodememoria8 ай бұрын
man what a great video
@axe8638 ай бұрын
Beautiful.
@flightman28708 ай бұрын
hats off to the effort and thanks to @primetime for showing us this gem
@Malte1338 ай бұрын
Great work and its true optimizing the code by splitting up commands in assembler is a pain. saw videos on that too.
@sawinjer9 ай бұрын
Software engineers can have two mindsets - one of being the best and the other of being the worst. After watching this video, I quickly fell into the second. 10 out of 10
@ElZafro_9 ай бұрын
Impressive! Never heard of these optimizations druing my computer engineering grade
@Gefionius4 ай бұрын
Incredible video. Feel like I am getting a PhD in Optimization just by watching and I haven’t really coded for 20ys 😂
@insu_na8 ай бұрын
great video
@skope20559 ай бұрын
Great!
@stavgafny9 ай бұрын
Just seen ThePrimeagen react to your last video lol
@simian34559 ай бұрын
I cant lie I do not program by any stretch of the imagination... but your videos are something else, the clean and clear explanations are just amazing.
@rabbitcreative9 ай бұрын
I have a real-world example that feels-to-me similar what is described here. It involves a Postgres install on a machine with a RAID-5 setup. Benchmarking showed peak ~15k TPS, for about 30-seconds, then the TPS would degrade to ~2k TPS, and widly fluctuate up and down, for about 30-seconds, and then resume at ~15k TPS for another 30-seconds. This behavior was steady over longer benchmarks (10 minutes). I used Postgres's table partitioning ability to solve this. First I used 10 partitions, which made no difference. Then I used 100 partitions, and while the peak throughput capped at ~13k TPS, it was steady for the entire 10-minute benchmark. Another, more human example, is writing the same sentence over and over on a chalkboard. Easier to write "I, I, I, I, will, will, will, will, never, never, never, never, chew, chew, chew, chew, gum, gum, gum, gum." Instead of "I will never chew gum", etc. It's all related.
@birdbrid93919 ай бұрын
best visualizations i've seen gj
@codeChuck5 ай бұрын
This lvl of perf is amazing! The depth of knowledge and understanding is stunning! Is it possible to do something like this in Typescript / React / Next / tRPC?
@samuelwaller4924
4 ай бұрын
If you mean in JS, then realistically no. You can maybe do some of these optimizations using some of the newer arraybuffers or whatever they are called to use "real" arrays, but at the end of the day you can't guarantee that the runtime or interpreter won't mess something up. It just wouldn't be worth it, just use a library
@ChaoticNeutralMatt8 ай бұрын
Once you reminded me of the context of the last video, I believe i know where this is going
@ujjwalaggarwal70658 ай бұрын
just amazing
@NIghtmare87918 ай бұрын
Its a great videof or optimizing a matrix multiplication from a purely technical/computing science standpoint. But i think improvement you still could make is to multiply the matrices with something like the Strassen algorithmus ( nowdays there even faster algorithm), but a matrix multiplication doesnt need to have a cubic ^3 runtime. You can actually do matrix multiplications in slightly less for example ^2.8, which should give significant improvements when multiplying big matrices
@wiipronhi9 ай бұрын
OMFG I love these videos
@saltyowl32298 ай бұрын
Please tell me the next one will only be a month or two. I love learning this level of optimization and it is conveyed so well. I will openly admit I’m being selfish cause I don’t wanna wait XD, though I do understand if that can’t be the case, hardly got a lick of free time with my own courses as well
@MNNoxMortem9 ай бұрын
Amazing.
@mNikke5 ай бұрын
It would be cool if you would follow up on this using strassen-algorithm with z-ordering (morton ordering), and maybe even avx512 instructions. I think you would get pretty close for large matrices.
@jupiterbjy8 ай бұрын
kinda wonder why our university barely have such lectures, amazing video
@mfinixone14179 ай бұрын
Amazing
@Mefistofy9 ай бұрын
Beautiful video. Understandable even for a EE grunt like me. The take home message at the end was a little disheartening though. AFAIK only Intel and Nvidia churn out hyper optimized libraries. Binding you to their tools. I'm not sure about the impact of using Python for ML but I guess all the nice libraries do something similar or use MKL directly (numpy does this with specific versions). Just a user of all the shiny tools to build new shiny tools.
@69k_gold9 ай бұрын
DSA pro: You get 1900ms, take it or leave it Guy with an intel chil and an ASM course: Hold my cache
@shakibrahman9 ай бұрын
anybody got any online course/book similar to the course DepthBuffer mentions he took at university for highly optimized code?
@simonl1938
9 ай бұрын
I'm also interested in that :)
@winda1234567
9 ай бұрын
same @@simonl1938
@leshommesdupilly
9 ай бұрын
also interested :D
@depth-buffer
9 ай бұрын
We are not using any textbooks during the course, and the course is also not recorded. But the slides are published here: pages.cs.wisc.edu/~sifakis/courses/cs639-s20/
@shakibrahman
9 ай бұрын
@@depth-buffer awesome! thank you
@TylerHallHiveTech8 ай бұрын
Great video! Do you mind sharing the source for your motion canvas? I like the implementation of the graph animations. Just getting used to motion and was inspired how you used it. Thanks!
@depth-buffer
8 ай бұрын
Currently, my repository contains source code for my next video so I can’t make it public. But I’m using my own custom fork of motion canvas and some effects in the video require those custom features. You can check it out if you want: github.com/fangjunzhou/motion-canvas
@tarn_jihas9 ай бұрын
subbed :)
@AREACREWBMX7 ай бұрын
Genius
@Kotfluegel9 ай бұрын
There is an error in how multiplication of the block matrices is animated. As the current active result block is being calculated the blocks of the input matrices should iterate through rows and columns of blocks just like with regular matrix matrix multiplication.
@__samuel-cavalcanti__9 ай бұрын
@depthBuffer can you share the book or materials that you have read to understand this topic ? btw, blazing beautiful video
@ericraio71458 ай бұрын
Came here from primeagen! :D
@manuelmoscardi66068 ай бұрын
just bruh, what's this video, why you have only 8k sub, why the animation are so precise, why I'm here
@aeebeecee37377 ай бұрын
i just subed this channel
@marketsmoto31808 ай бұрын
this dude's channel bout to get huge now
@huanxu51408 ай бұрын
Maybe Intel used AVX vector intrinsics under the hood if your CPU supports it and that leads to perf difference.
@groff86579 ай бұрын
This is the first time I've ever heard the term 'giga-flops', and I quite like it.
@minhuang8848
9 ай бұрын
If you're new to the field, which seems to be a fair assumption... I highly recommend checking out machine learning papers and their ridiculous names. A tiny selection: - We used Neural Networks to Detect Clickbaits: You won't believe what happened Next! - Learning to learn by gradient descent by gradient descent - BERT has a mouth, and it must speak - Gotta Learn Fast: A New Benchmark for Generalization in RL - Everything involving BERT and ERNIE and the full range of Sesame Street characters Yeah, there's lots of whimsical stuff going on in that particular space, and it often is pretty relevant to the actual topic at hand, too.
@geckoo91909 ай бұрын
The other day I was watching a video that was implying that using less conditions makes the code faster, but then I ran some tests in assembly and saw that without the conditions, the code was longer
@Erlisch1337
9 ай бұрын
that would depend entirely on what you are doing
@poproporpo9 ай бұрын
Motion canvas!
@4SnapIT3 ай бұрын
So the first optimization is all about cache line utilization and not so much about cache hits. Make sense when you think it trough. Big percent of cache line is "garbage" data in the vanilla implementation.
@thanatosor8 ай бұрын
Imagine how much underlying low level mechanism was hidden from programmers. They aren’t supposed to know if they don’t have to optimize to the limit.
@Rudxain8 ай бұрын
6:24 That's like constant/static recursion, which is much faster than dynamic recursion (specially if there's no TCO). 11:16 Can NUMA help with this?
@stacksmasherninja72668 ай бұрын
Isnt there an algorithm that uses 7 computations instead of 8 for the 2x2 GEMM? Cant we exploit that to always work in 2x2 blocks? Also, sweet animations dude! What tools do you use for these? Is that just manim?
@depth-buffer
8 ай бұрын
I’m using motion canvas
@colonthree8 ай бұрын
Shoutouts to CinemaScope. :3
@user-gt2th3wz9c8 ай бұрын
would it even be possible in higher level languages as javascript and python?
@thanatosor8 ай бұрын
Memory access now is the bottleneck for CPU
@abbasballout44416 ай бұрын
Wait what's the music?
@machinelearningTR5 ай бұрын
Anybody knows how can i make animation like this video? Which app should i use?
@imlassuom8 ай бұрын
🌟🌟🌟🌟🌟
@snoozy3557 ай бұрын
What's your IDE at 13:55 ?
@somniad8 ай бұрын
this video is w i d e
@vbregier8 ай бұрын
Please, if you mention MKL, do mention that this is a non-portable intel-only library ! It won't work at all on any non-x86 compatible architecture, such as arm, and has poor performance on AMD cpu. There are some open-source portable alternatives to MKL (blis, libflame). Don't trade portability for performance on a single CPU family !
@dalicodes8 ай бұрын
from Prime
@Vaaaaadim9 ай бұрын
Maybe a factor in why MKL is 8x faster than your implementation is that it uses something like the Strassen algorithm when it recognizes that you are just doing matrix-matrix multiplication of matrices with real values?
@dmitrykargin4060
9 ай бұрын
You need larger size for strassen to bring such advantage
@Vaaaaadim
9 ай бұрын
@@dmitrykargin4060 I only said it might be a factor, a few recursions could be applied whilst still using the typical algorithm. Quoting from en.wikipedia.org/wiki/Strassen_algorithm "In practice, Strassen's algorithm can be implemented to attain better performance than conventional multiplication even for matrices as small as 500x500". And yeah wikipedia is not necessarily always right, but looking at the paper cited for this statement, it does seem to support the claim.
@dmitrykargin4060
9 ай бұрын
@@Vaaaaadimit is not 3 times difference for this size. You need far larger matrices to get that. I bet it was avx512 with smart reductions to get that 3x boost
@theexplosionist2019
8 ай бұрын
No. Strassen accumulates errors.
@fulbone28488 ай бұрын
STRASSENS!!
@Ximaz-8 ай бұрын
What's your CPU ?
@sammtanX8 ай бұрын
Wait. What is your CPU? How do you know how much GFLOPS your CPU have?