mCoding
Күн бұрын
76,799
1

C++ cache locality and branch predictability

Ғылым және технология

Cache me outside, how bout that?
People always talk about Big O time for analyzing speed, but Big O isn't the only important factor in writing performant code. Two important things to keep in mind are cache locality (locality of reference) and branch predictability. In this video, we go over these ideas and through examples we see the huge impact that they can have on performance.
― mCoding with James Murphy (mcoding.io)
Source code: github.com/mCodingLLC/VideosS...
Branch prediction: en.wikipedia.org/wiki/Branch_...
Cache locality: en.wikipedia.org/wiki/Localit...
Awesome CppCon talk: • CppCon 2017: Chandler ...
SUPPORT ME ⭐
---------------------------------------------------
Patreon: / mcoding
Paypal: www.paypal.com/donate/?hosted...
Other donations: mcoding.io/donate
Top patrons and donors: Jameson, Laura M, Dragos C, Vahnekie, John Martin, Casey G, Pieter G, Krisztian M, Sigmanificient
BE ACTIVE IN MY COMMUNITY 😄
---------------------------------------------------
Discord: / discord
Github: github.com/mCodingLLC/
Reddit: / mcoding
Facebook: / james.mcoding
CHAPTERS
---------------------------------------------------
0:00 Sorting
1:26 Cache locality
2:41 Vector traversal
4:39 Matrix multiplication
7:04 Branch predictability
9:56 Branchless optimization

Пікірлер: 147

@spaghettihair2 жыл бұрын
Why does the ternary expression at the end not count as branching code?
@robinm7079
2 жыл бұрын
With a true branch, the statements that execute afterward the branch are different depending on the result of the condition being checked. So with the first example, the if statement, we could have added a bunch more lines of code in the first block (after p *= 2 * x), that wouldn't be executed if the condition was false. With the ternary expression, there's still a condition being checked, but it doesn't effect the statements afterward, the processor knows what's going to happen next and can start loading the next instruction, there's no uncertainty
@mCoding
2 жыл бұрын
What I was trying to say here was that the logic could be refactored where the contents of the if are the same with just a different constant. In that case, the ternary can be implemented with a value array instead of a branch. Note that the expression (x > thresh) is a comparison, not in itself a jump. So you can do arr[x > thresh] to read from index 0 to set factor to 1.5, or read from index 1 to set factor to 2.0, which is what clang does. Technically the compiler could use a branch or not use a branch for either the ternary or the if statement as the two are equivalent. I was presenting the ternary as a kind of visual stand-in to superficially show the branch being removed, which is of course not how C++ actually works. I see now that this storytelling only caused additional confusion so here is the actual assembly to keep in mind: godbolt.org/z/zoGWxcoeb. Notice the line mulsd xmm2, qword ptr [8*rdx + .LCPI0_1] that indexes into a 2-element in-binary array .LCPIO_1 instead of using a conditional jump. In the link, clang produces an identical binary whether you use the ternary or the if and it does not use a branch. On the same code gcc produces identical binary whether you use the ternary or the if, except it uses a branch for both.
@MithicSpirit
2 жыл бұрын
@@mCoding on gcc was it compiled with or without optimizations (-O2 flag)?
@mCoding
2 жыл бұрын
Both at -O3.
@sebastianmestre8971
2 жыл бұрын
The thing that branch prediction predicts is not the outcome of comparisions, but the destination of jumps. With a plain if statement, a naive compiler might emit a comparison followed by a conditional jump, whose destination the CPU will try to predict at runtime. With a simple ternary like that, a compiler might have an easier time compiling it down to a single conditional move instruction, which is not followed by any jumps. This way, there is no branch (no different destinations for jumps) in the loop, even though a conditional DOES occur. (there are other ways to compile stuff into conditional-jump-free code, like using look up tables, etc)
@Broniath2 жыл бұрын
My first thought seeing this code: This is some weird ass python library, it almost looks like C++. Several seconds later: Oh.......
@ronyWeeb2 жыл бұрын
It's a proper coding channel. Every other coding channel's I had seen are like 'pop culture' coding. But this is different. Great work man.
@mCoding
2 жыл бұрын
Thanks for your kind words!
@10e999
2 жыл бұрын
I completly agree with you. This is a real software engineering channel. Not recycled dogmatic advices.
@creator-link
2 жыл бұрын
What are some pop culture coding channels? I want to watch some
@tannerted
2 жыл бұрын
Agreed. Also look at Jonathan Blow, Molly Rocket, and Cherno. All good channels like this one.
@QuantumHistorian2 жыл бұрын
Always nice to get something a little closer to the metal than python from time to time
@mytech67792 жыл бұрын
C++20 has added the attribute keywords Likely and Unlikely so the compiler can do some of its own predictive branching optimizations. It make me wonder if CPUs will start to include an instruction to make even more use of this in a few years.
@Qbe_Root
Жыл бұрын
PowerPC conditional branch instructions include hints for prediction, so beq+ means the branch is likely and beq- means it's unlikely. I assume other architectures have the same feature but I'm not familiar with them
@LennyTheSniper2 жыл бұрын
I watched the video 10 times. Now, my code is going 1024 times faster. Thanks a lot man!
@szaszm_2 жыл бұрын
I love the simplicity of your sorting algorithm implementations.
@mCoding
2 жыл бұрын
Thanks, I stole all of them from various cppreference docs.
@Spiderboydk2 жыл бұрын
Great point. When optimizing, *always* measure, people! Merely picking a particular algorithm, because it's supposed to be fast, is nothing more than superstition. You *can't* know without measuring.
@mCoding
2 жыл бұрын
Always always always. "If you aren't measuring, you don't actually care about performance."
@Yazan_Majdalawi
Жыл бұрын
@@mCoding What are the tool/s you used in this video?
@philipp65892 жыл бұрын
Since you are doing Python and C++ topics, would be nice to see some video about developing C/C++ extension modules for Python. Is there any easy-to-use framework which allows just-in-time compilation of C++ code snippets inside of Python?
@mCoding
2 жыл бұрын
I don't know about JIT for C++ code, but there is Jax, which JIT compiles Python to XLA and offers a numpy-like interface. There's also Cython, which allows you to embed C and C++ into Python, but this is pre-compiled, not JIT.
@FobazF
2 жыл бұрын
Check Numba
@pranavnyavanandi9710
Жыл бұрын
What is this just in time compilation? Do you mean an interpreter? Are there any other types? I thought compilation was just compilation.
@brunoais2 жыл бұрын
9:20 In some AMD (I don't know the latest intel yet) CPU, the CPU actually computes both multiplications at the same time and then waits for the condition to finish to choose which one to use. I'm unsure if it was intoduced in Zen or only Zen2
@spicybaguette7706
2 жыл бұрын
I believe It's called speculative execution, but this leads to some very funky side-effects that can be abused by attacks such as spectre and meltdown
@wChris_2 жыл бұрын
it would be nice to have shown the Compiler Explorer output for the branch predicting example, as a ternary is technically a branch that might get optimized into a conditional move
@mCoding
2 жыл бұрын
See the pinned comment!
@fabricio5p2 жыл бұрын
Man, I wish you did more of these low level optimization videos. I don't work with Python but I subscribed just because of these types of videos
@mCoding
2 жыл бұрын
More on the way!
@prashantrana10892 жыл бұрын
this guys offer so good content for free. this guy is awesome
@lefttraces72072 жыл бұрын
I love how I only know two languages and these happen to be the only two you cover, haha. Great video as always!
@mCoding
2 жыл бұрын
Lucky both of us!
@mauricio-poppe2 жыл бұрын
Awesome, thank you for an educational video straight to the point.
@mCoding
2 жыл бұрын
Very welcome!
@johnjoyce2 жыл бұрын
Nice soft intro to practical examples of performance tuning.
@danielbelsky88369 ай бұрын
That was sooo clear!! Thank You!!
@alexander32932 жыл бұрын
love this kind of videos!!
@xvA00005 ай бұрын
these will age like wine. thanks for the vid
@TDskirvin2 жыл бұрын
very informative & succinct, thank you
@mCoding
2 жыл бұрын
You're very welcome!
@yash11527 ай бұрын
1:11 > _"sorting small things is actually very common"_ i didnt know. thanks
@Khushpich2 жыл бұрын
Great video as always.
@mCoding
2 жыл бұрын
I appreciate that!
@robertbrummayer49082 жыл бұрын
Excellent video as usual, James!
@mCoding
2 жыл бұрын
Thanks again!
@venkateshhariharan43412 жыл бұрын
great video bro, helped alot thank you so much ❤️❤️❤️
@mCoding
2 жыл бұрын
Glad to hear it!
@ABrainrotAwayFromHeaven2 жыл бұрын
holy shit. What an amazing channel! I wish i could learn even more of this.
@mCoding
2 жыл бұрын
Thanks very much! I will do my best to make more useful content! Check out my other C and C++ videos! I do have a few.
@rotbjorn2 жыл бұрын
Interesting video! What program did you use to view the graphs & plots and what library did you use to create them?
@mCoding
2 жыл бұрын
The plots are viewable in any modern browser. I made them in Python using plotly express. You can see the code for making the plots on github!
@Swedishnbkongu2 жыл бұрын
How much can the compiler do loop reordering, or arithmetic reordering (for more ILP), or reducing the number of branches? I've been taught to trust the optimizer more than myself. If i can confirm in compiler explorer that it is different when i micro optimize, maybe I'll do it
@kyyrabb11212 жыл бұрын
Please do more c++ videos!
@harveyroper55262 жыл бұрын
great video, it would have been really useful if you had gone into a little more detail on why cache locality works, ie storing multiple addresses into a single cache line. other than that i really enjoyed seeing a high level application of this concept
@Windeycastle6 ай бұрын
I loved to see std::sort laughing at all other implementations. It's a good call to use the standard library when you can, as it's often more optimised then what you'd write yourself (with some exceptions).
@kivicode2 жыл бұрын
Great video as always! Only one question: what editor theme are you using?
@mCoding
2 жыл бұрын
Monokai Pro Theme
@aratakarkosh95882 жыл бұрын
Holy smokes I already knew it, first time on this channel :DD
@mCoding
2 жыл бұрын
You're becoming wiser!
@ciCCapROSTi Жыл бұрын
What is missing is that YOU (the coder) yourself should prefer branchless code. Your example could have been written as an arithmetic formula. The compilers are smart, but we can help it. It's even better at optimizing algebra than optimizing branches.
@mk-mc8yx2 жыл бұрын
In another article I read that we shouldn't be optimizing our code for branch-prediction. If that's true why should one be concerned about it, if so it seems like a good to know thing. If that's not correct, are there any good coding practices that optimize our code to take advantage of branch prediction ? Btw, I found out this channel just few days ago. Very informative and lots to learn for me from here. Thankyou very much.
@mCoding
2 жыл бұрын
Most articles that recommend this are not recommending that you avoid optimizing branches, but rather that there are most likely other more important things you should focus on first, and that you should profile your code to find out what is actually slow before blindly trying to optimize branches. Of course, in a high performance systems not otherwise limited by network or io, eventually heap allocations, branches mispredictions, and cache misses start to become the dominating factors that you need to optimize.
@marcusmors84852 жыл бұрын
Good video! Thanks
@mCoding
2 жыл бұрын
Glad you liked it!
@kinershah4648 ай бұрын
How does the type of cache (k-way or associative or k-eay set associative) play any role in performance? Or does it?
@Yupppi5 ай бұрын
And this is why for example's intel's algorithm is far better than writing loops by yourself (the I assume famous video of matrix multiplication algorithm taking it from like 2000 ms to 2 ms). I love branchless solutions too, they feel like magic.
@ProjectPhysX2 жыл бұрын
9:12 how about the ternary operator ?: or bit masking?
@danielrhouck8 ай бұрын
1:00 I tried to read the example code for the length of the vector, but unless it’s 0 I don’t know the benchmarking library enough to have a guess. But I think I was on the right track and the vectors were too short.
@Zooiest Жыл бұрын
Can you make a video about vectorization and/or multithreading?
@louisfghbvc2 ай бұрын
impressive about cache locality...
@dcorderoch2 жыл бұрын
"depending on ... the time of day" at first I laughed, then I thought... depending on where the computer is, the temperature could affect the performance... nighttime might be better for performance
@venkateshhariharan43412 жыл бұрын
how these results will be affected when using with constexpr for ordered input?, probably get faster I don't know, I will be trying that soon. thanks for letting us know these amazing concepts with practical examples
@mCoding
2 жыл бұрын
constexpr could allow you to do some of the computation at compile-time, so that would be a big improvement over doing it at runtime, but depending on the use case this wouldn't necessarily be possible. As of C++20, vector is not constexpr. However, it could be possible with array.
@ieatgarbage87713 ай бұрын
How does prayer impact cache hit rate?
@aadityabhetuwal59902 жыл бұрын
Which IDE do you use?
@---hw6up2 жыл бұрын
Why do you usr auto for type deduction in function declarations? Looks infernal
@mCoding
2 жыл бұрын
std::vector::iterator is too long to fit on screen
@---hw6up
2 жыл бұрын
@@mCoding Didn't realise the sorting was done using iterators. My bad
@thiagosannafreiresilva43662 жыл бұрын
So normally how much will compilers flag and / or optimize this sort of thing? I'm just moving from interpreted to compiled languages and I don't quite have a grasp yet to the extent of optimization done by compilers.
@mCoding
2 жыл бұрын
Compilers are fairly decent at optimizing branches away when possible. A compiler will almost never do an optimization like exchanging ijk for ikj in a for loop though because without knowledge of the whole program (e.g. we were computing the sum of integers which is commutative) changing the order of things might change the behavior of the program. Basic things like removing unused variables and not recomputing things that it can prove haven't changed will usually be done. I highly recommend Prof Leiserson's course on performance engineering in C (available on youtube, it's mit ocw) if you want to learn more. Also any CppCon talk.
@thiagosannafreiresilva4366
2 жыл бұрын
@@mCoding thanks!
@szaszm_
2 жыл бұрын
@@mCoding GCC can perform the ijk -> ikj exchange with -floop-interchange (enabled at -O3) or with -floop-nest-optimize (which includes even more nested loop optimizations) if compiled with --with-isl.
@pengchengwu4473 ай бұрын
std::sort uses insertion sort for small sizes too I believe, why is it faster than yours?
@aadithyavarma2 жыл бұрын
Hi, I love your video. Are these principles applicable for Python also?
@mCoding
2 жыл бұрын
Yes but to a much lesser degree. Because every object in Python is a pointer, practically every access of an object behaves like the random order access example and is a cache miss, which is one of the reasons for Python's general slowness.
@switchblade6226
Жыл бұрын
@@mCoding WARNING: Resurrecting old comment section to expand a bit on that :P Modern CPUs do preform unconditional branch target prediction via multi-level prediction tables (more info can be seen in Agner Fog's optimization papers), and those do help with dynamic languages' pointer hell, but obviously only to a degree. Also, hypothetically, if you JIT python you would be able to potentially take advantage of conditional branch prediction too. And if you have an extremely sophisticated JIT that would flatten and simplify python structures, that can improve cache locality. But that would require some genie to make a miracle JIT happen lol
@echoptic7752 жыл бұрын
Ik curious what are ur thought about rust lang?
@mCoding
2 жыл бұрын
Haven't tried it but I've heard it has a lot of nice properties
@veritas70107 ай бұрын
Would be awesome if the use of compiler opt -O 0,1,2,3 was benched
@empireempire3545 Жыл бұрын
Thats great, but how to benchmark where your code is actually having problems with cache?
@mCoding
Жыл бұрын
Great question and unfortunately i don't have a great answer for you, at least not a definitive one. Generally you want to start with a higher level view of your application. Benchmark/profile the thing as a whole to get a feel for which parts are slow. If you have certain end to end tasks such as producing a complex value or rendering a frame, create benchmarks for those individual tasks you care about and profile them individually. This might involve engineering benchmark code into your application itself (turned on or off with a macro). The profile should help you determine at least broadly where any slowness is. Then you can start to hone in on those slow parts. Look at them and ask yourself WHY are they slow? What data is going in and out, how is it laid out in memory, are your branches predicatable, do calculations that happen near eachother live near each other in memory, and does as much data for any given calculation fit inside a cache line? You can also use tools like perf (linux only, but im sure windows has something similar). Your hardware has counters that track how many cache misses, branch mispredictipns, etc your program caused, and perd will report these numbers to you. You can then use perf and other such tools to get a feel for whether cache or branch predictions are the problem or something else, like a deeper architectural issue or inherently slow algorithm. In general there is no one answer and no one tool that works in all cases, it's just something you get better at with experience. Hope this helps.
@puddleglum56102 жыл бұрын
One note -- you should never optimize your code prematurely! A lot of times, it creates more bugs and development overhead than its worth. Optimization should be for when you're having performance issues. But, nonetheless, super great video! This brings me back to my old Computer Architecture classes in college.
@colin3989 ай бұрын
Pretty sure std::sort defaults to some type of insertion sort for small sizes
@TheJackal9172 жыл бұрын
Nice japanese explanations, man. Cool.
@netfri252 жыл бұрын
what ide do you use?
@mCoding
2 жыл бұрын
This is CLion.
@netfri25
2 жыл бұрын
@@mCoding thanks a lot!
@ItzSwiftyBoy_Gaming2 жыл бұрын
Damn, you're coding genius.
@10e9992 жыл бұрын
10:22 How is the line 11 not a branch?
@Yazan_Majdalawi
Жыл бұрын
See the pinned comment :)
@lightify972 жыл бұрын
Please make a C++ course in the same style.
@BenjaminWheeler05102 жыл бұрын
I’m confused. What are all these curly braces??? Is this some weird Python fork?
@mCoding
2 жыл бұрын
Not sure if joking 🤔. This is C++.
@padnomnidprenon9672
2 жыл бұрын
Im not the only one who told to myself : "Wow new word auto and var declaration, what is this package". And then i saw static void
@adicide9070
2 жыл бұрын
@@mCoding yeah must have been joking. and a good one :D python fork :D:D:D
@hievery96782 жыл бұрын
But is it always worth code readability? I mean, getting x2 performance is cool and all, but maintaing sometimes almost unreadable code because of such optimizations may be even worse. @mCoding May be you have some production experience with them? Have you used them there? How your coworkers reacted to such weird shananigans with code? Or you tried to use them only when readability wasn't suffering?
@mCoding
2 жыл бұрын
I would say that writing code with good cache locality and branch predictability rarely involves any degradation in readability. There are exceptions, particularly when performance is highly critical, e.g. in electronic trading, that you might design your whole application around these properties, causing readability to suffer. However, in those cases the readability of the code was already chosen to be second priority to the performance due to the nature of the code.
@irwainnornossa4605 Жыл бұрын
{ belongs to a new line! We say „2 times“, not „2 x“!
@andrewf83662 жыл бұрын
Why didn't you plot more points in your comparison? You mentioned that the largest "kink" was at 8MB... but you also didn't have another point until 16, so it's not as good a demonstration as could be used if you had more points to show.
@mytech67792 жыл бұрын
Only 8M of cache? My 386 had 4M of RAM. (About $1000 of RAM adjusted for inflation.) And it could run MSwindows.
@Pa_Nic2 жыл бұрын
If you watch this video 1000 times, you can break RSA-4096
@mCoding
2 жыл бұрын
Shhhh :)
@CAMOBAP7952 жыл бұрын
Interesting should I thing about Branch Prediction when I work with High-level languages (Python, Java)? P.S. Few years ago I saw some presentation about "cache locality aware" coding in Java
@subnumeric Жыл бұрын
IEEE 754 floating point numbers are NOT commutative! You've used templates in the get_sum function, so depending on the data this implementation might return very incorrect sums. This is in fact what the unsafe -Ofast optimizations do (in gcc), as in some cases it doesn't produce significant errors and the speedup is worth it. I know averaging isn't really the focus of this video, but I felt like i had to mention it.
@mCoding
Жыл бұрын
Yes I'm aware of the that these implementations are not general purpose production ready. Indeed, even implementing the average of 2 numbers correctly is a monumental task, so I hope you unserstand why I wouldn't want to attempt this kind of thing unless it is was a key point of the video. But yes, always watch out for noncommitativity hiding in traditionally commutative operators like +!
@subnumeric
Жыл бұрын
@@mCoding I do understand, it's just that these sort of "details" get you thrown out of university exams, and much worse, neuter rocket defense systems and cost people their lives. I simply felt obligated to mention it, which I in turn hope you understand. Alas, this is indeed a confusing university-level topic and you have to simplify it into a 10 minute video, so I suppose details being cut is to be expected. Please, do not misunderstand my capitalized "NOT" and the exclamation mark as some harsh critique, but rather as a "passionately highlighted" footnote in a hefty textbook. I appreciate the effort you've put into you video, which is clearly demonstrated by your response to a comment on a several years old video. =)
@gregorymorse84232 жыл бұрын
?: is branching so is && and ||. int c=x > thresh; 2&c|1.5&!c would be the branch less form. Which is how the compiler would optimize it effectively. Also small sizes being faster for less complex algorithm is a known constant threshold which is hard to precisely compute given the complexity of the state machine in the processor. You engineer micro optimizations based on data sizes. Merge sort is not as fast as an in place merge sort, and you didnt specify this detail
@mCoding
2 жыл бұрын
See the pinned comment for what I was trying to convey with the ternary. I'm also curious why you would call a 2x performance difference that I demonstrated across multiple different problems a micro optimization.
@gregorymorse8423
2 жыл бұрын
@@mCoding yes I suppose table vs. Bitwise logic is debatable depending on exactly what is being optimized. But if there are side effects on the true or false expression, control flow required!!
@axe8639 ай бұрын
Cache me outside .... really bro
@ThePC0072 жыл бұрын
10:29 Ternary operators are just syntactic sugar for an if-else statement. They don't eliminate branching at all. That said, given that booleans are just 0s and 1s this code could probably be optimized to a branchless multiplication, which is what I believe eliminated the branching in this case.
@mCoding
2 жыл бұрын
See the pinned comment where I explain what I was trying to suggest here!
@ssspud35452 жыл бұрын
I didn't even notice it was 7 hours long until the closing notes... I was recommended this after never being recommended anything from pyrocynical in over a year while being subscribed. I have to admit, these documentary type videos that review or cover a TV show or piece of media are extremely popular now and basically equal free watch time. I mean you have all the work cut out for you. Just watch the show and have a comprehensive knowledge about it and writing skills; which you can learn in college or even a US highschool(I meant college in Europe and in the US because community colleges are basically the same concept in both regions).
@mCoding
2 жыл бұрын
7 hours long? I think you might have commented on the wrong video.
@coder436 Жыл бұрын
7:30 Actually no
@falxie_2 жыл бұрын
I really should probably go to college so I can learn techniques like this
@chainingsolid
2 жыл бұрын
Just use some of the terms from this video and google. Much can be learned online and testing stuff your self. If your curious message me and I'll dig up some info (aka more youtube videos).
@trag1czny2 жыл бұрын
discord gang 🤙🤙
@MithicSpirit2 жыл бұрын
Discord gang
@lunarmagpie43052 жыл бұрын
Discord gang