Rust multi-threading code review
Ғылым және технология
my 3d cellular automata simulation written in rust with multi threading was sub-optimal. Using benchmarks with criterion and the help from my community, I learnt what the bottlenecks of this application are.
Today I share the journey I went through to optimize my multi-threading code!
Let's talk about the code... ;)
original video 3D cellular automata video:
• 3D Cellular Automata -...
3D Cellular automata project links:
Source code: github.com/TanTanDev/3d_cellu...
IsseW gpu implementation: github.com/IsseW/cas
Dstaatz atomics implementation: github.com/dstaatz/3d_celluar...
BENCHMARKS I made for this video:
github.com/TanTanDev/cellular...
My discord group:
/ discord
Want to support me?
Patreon: / tantandev
Monero: 43Ktj1Bd4Nkaj4fdx6nPvBZkJewcPjxPB9nafnepM7SdGtcU6rhpxyLiV9w3k92rE1UqHTr4BNqe2ScsK1eEENvZDC3W1ur
Credits:
Music Prototype by Wintergatan Build Tracks
Music Valentine by Wintergatan
Music The Rocket by Wintergatan
Music Non-Euclidian Geometry by Punch deck
Music Brazilian Street Fight by Punch deck
Music See The Light by Punch deck
Wintergatan link: wintergatan.net/
License can be found on website
Punch deck link: / punch-deck
Punch deck license: creativecommons.org/licenses/...
#rust #multithreading #review
Пікірлер: 243
Very valuable video. Some youtubers and bloggers keep on insisting that efficiency doesn't matter cause "new compooter phast" and have a considerable influence on new learners who get stressed and look to validate if what they are learning is worth it or not. And, at that point if they come across people like those, it could completely misdirect them from becoming better engineers. While what has been done here to compare is not a usual application, it goes to demonstrate how much efficiency matters and how all those little nanoseconds and microseconds add up quick.
@thoriqadillah7780
2 жыл бұрын
I bet the people you meant are javascript developer. No offense lol
@ItsMeChillTyme
2 жыл бұрын
@@thoriqadillah7780 Mostly but I've seen it come from new developers for everything, even C++. Though JS and Python have it most since they are beginner friendly.
@nagoranerides3150
2 жыл бұрын
@@thoriqadillah7780 I first saw the idea that efficiency isn't important because "new compooter phast" in the early '80s. The truth is that efficiency of coding is sometimes more important than efficiency of execution. I write 90% of my code in Perl, because I write a lot of small utility scripts for solving specific issues at work and quite often the code is single-use. Shaving a day off development and testing time is far more useful than even shaving an hour off run-time. If I have a task that will take weeks to run, then I write it in something fast so the return outweighs the time lost in coding in a more formal language. And something that requires realtime doesn't leave you much option.
@thoriqadillah7780
2 жыл бұрын
@@nagoranerides3150 but we're talking about game dev here. And the "javascript dev" is about frontend in web, or even mobile. I think the game dev part the need for speed is critical because we're dealing with customer pc. Remember "but can it run crysis?" meme? Yeah i remember that. And the frontend web dev i think speed is somewhat important, because we're dealing with user internet speed and data
@dealloc
2 жыл бұрын
You have to set constraints on what hardware you aim to support, otherwise we'd never move forward. For example, Valve has done a tremendous job at creating an engine that is versatile across low and high-end hardware-and Source 2 is highly targeting VR applications, so speed and varied hardware support is of utmost importance-but it is also very limited in its capabilities when compared to other modern engines that may not be suitable for low-end hardware. Of course, that doesn't mean you should max out everything and only aim for high-end hardware, since that can also have drawbacks in and of itself.
For the boolean array you could also try putting it all into a single 32 bit integer, where each bit represents one value. The check would then be something like (rule & 1
@Tantandev
2 жыл бұрын
Solid idea! Faster or not... we still save some bytes haha
@anderdrache8504
2 жыл бұрын
The compiler can't do memory layout optimizations like that. As a rule of thumb: it can only reorder struct fields, nothing else.
@zyansheep
2 жыл бұрын
Or you could use bitvec :)
@nilstrieb
2 жыл бұрын
It will be slower, not faster, since now there are several instructions to get the value. While the array will get slightly smaller, it doesn't really matter, since it's quite small anyways, so you won't get any benefit from the smaller cache usage. But it would be interesting to benchmark it
@quantumdeveloper2733
2 жыл бұрын
@@nilstrieb Trying to prove you wrong I tried to verify my hypothesises by googling: Originally I thought that the processor couldn't adress anything smaller than a word and therefor would have to do bitshifts to adress a byte anyways, but that doesn't seem to be the case for x86 anymore. And I also thought that my variant would allow the compiler to do more SIMD optimization because there was no SIMD memory lookup, but in avx2 there is an instruction for that as well. So I guess on modern hardware you are right. I also made a benchmark to be sure, and indeed on my avx2-capable desktop pc I get a 15% performance decrease. But on the raspberry pi 3b, which has an ARM processor, I get a 30% performance increase. It might be interesting to check more modern ARM processors as well, but I don't own one.
This video reminded me the mantra "make it work, make it right, make it fast". As always interesting and entertaining content, thank you.
@Leonardo-nf5jc
2 жыл бұрын
@JM Coulon you may have not implemented error handling for example, it work but it's not the right way to code
This is an amazing video and I'm super glad you opened up about the issues that were made and how it was fixed. Super healthy for the community and understanding programming in general.
The nested vector is not as messy as you made it look. For initializing you can do: vec![vec![vec![0u8; SIZE]; SIZE]; SIZE]. For indexing, you can do it just like the array: v[0][4][3]. The flat vector is definitely a better idea tho.
@biglexica7339
Жыл бұрын
even if it was messy, just write a macro
@shambhav9534
Жыл бұрын
Yep, a flat vector would be better. I think that's how it is intended to be used.
@erikwg3814
Жыл бұрын
Ye, especially the .get(0).unwrap().get(4).unwrap() and so on felt quite dishonest. If you aren't doing any checking anyway there is absolutely no point in using the get function
This is rare great content as most KZreadrs only cover the fronted. Thanks for this great resource on low level performant code.
It's just so interesting to see you explain and talk about different concepts! I love these videos.
Mad props for publicly sharing your mistakes for the rest of us to learn from. It speaks to your depth of character. Thank you!
You've quickly become one of my favorite Rust centered channels. Well done!
That's why i love coding. No matter how good you think you've made something, it's always room for improvement and learning new things.
You can count the number of neighboors with convolution. With 2D it would mean to slide a 3x3 matrix [[1, 1, 1], [1, 0, 1], [1, 1, 1]] in every position the cell matrix, multiplying the values in the same position, sum them and put in the center position. There are very fast algorithms to run this on CPU and even faster on GPU
@brdrnda3805
2 жыл бұрын
Can you please elaborate a little bit more? I really would like to understand.
@RottenFishbone
Жыл бұрын
@@brdrnda3805 Convolution is a pretty common technique for running some kind of matrix-defined algorithm in parallel on a large dataset, usually images. Each value of the matrix is basically an offset of a target cell holding a value to multiply by. So, given that the matrix is 3x3, the top row of the matrix would multiply [above-left, above, above-right] by 1's and sum them. The middle row would be [left, self, right], the value of 'self' is ignored, because the matrix is 0 at that position. The end result is you add 1 for each active neighbour i.e. sum of neighbours. You do this on each cell to build a new 2D array with the value of each cell being the sum of its neighbours. The benefit of convolution is that no cell's calculation depends on the calculation of any other cell. As such, you can have a kernel executing on each cell in parallel, potentially ALL at the same time. This is trivial to do in GPU API like CUDA or Vulkan.
This is my favourite video of yours! Diving into code, showing alternatives, and explaining exactly why one is better for a certain goal. That's such a great way to actually learn and level up as a programmer!
0:00 Eyyy It's me!
This was a great video. Great at explaining how one should go about optimizations and performance tuning. A bit light on how to actual measure performance, but hey, that could be another video 🙂 Anyway, the message you wanted to convey is pretty clear and that's cool. It's also really cool how you handle feedback it seems. I suck at handling feedback, especially feedback that criticizes my code. I need to become a lot better at that.... Again, thanks for a great video!
Really loved that you took time to explain where you could have done better. Not a lot of youtube content where they dissect a cool project to look for optimizations. Great job:)
Just found your channel about a week ago. Cube World was my jam, full support for this game of yours! Keep up the awesome work, you're keeping the hopes of many fans alive.
@scoringdigitsson.5194
2 жыл бұрын
checkout veloren
Fantastic to see how much can be optimised! It seems like you've learnt some great tips here and I can't wait to see how it applies to your voxel game.
To actually see an influencer not get big headed and reject all possible advice by saying they don't need it, is really good. Usually you'd have so many people tell you how good you are that you'd stop caring enough for those who might help you improve. I think it's great that you're open to feedback and took it in the right way. Great stuff, easily one of my favourite channels
Hmm, I wonder how this would look with Rayon. For testing, I did a simple mandelbrot program, and then simply plugged in Rayon, and it scaled really well. Maybe a project for next weekend :D
@joshxwho
2 жыл бұрын
I love rayon, for my CLI apps I build it makes multiprocessing super easy.
@m.sierra5258
2 жыл бұрын
Mandelbrot is embarrassingly parallel though, it's a whole different story if you have a border exchange like in cellular automata. That's where the real difficulty of efficient multithreading begins :)
@SbF6H
Жыл бұрын
Rayon? That shit which is for toy programs? Sure.
@adamhenriksson6007
Жыл бұрын
Rayon feels kinda like using go func in go. So simple and essy
@Andrew90046zero
Жыл бұрын
I've been trying to use Rayon to multithread another type of simulation I'm making. And I used it on one part to see if there was any improvement. But It seems like it only reduced the performance of the simulation! I think there is still a lot I have to figure out...
A good story with a proper lesson, you're a great storyteller Tantan! Thank you for the quality entertainment.
6:11 2.05us = 2050ns , not 205000 ns, so it is a 7.5x improvement, which is still pretty worthy
@JackEnneking
3 ай бұрын
@rishabhkhemu He's correct on hashmaps at 3:10 : 35 us = 35,000 ns.
This code review style video is a really insightful way of understanding code. More please.
0:01 my typo will live forever! Great video! I read through the whole repo after the last one. Wish I could have contributed but I came after all the easy work was taken 😂. I’d love to see more videos like this! ❤
I really enjoyed this breakdown. Also nice way of looking at the code with editing.
Let's summarize what I've seen: You had an idea and shared your approach on your YT channel. You managed to write functional code and you managed to deliver a result. Others (in your team, bc c'mon, we're all happy coder buddys) felt the urge of improving your well intended path to success with a more robust and reliable solution. I'm impressed to see you entertaining us with your learning, thank you for that! Because that story would otherwise be a fantastic and totally realistic representation of why 80% of all software projects fail: skipping non-functional requirements, skipping testing, skipping error handling, skipping the path from knowledge to understanding, skipping quality at all, keeping the total bus factor below 1. So congrats, you just got a new subscriber who's speaking about such topics on community events. :)
Along the lines of your "Measure before you optimize" insight, Gene Amdahl (one of the guys who created the IBM mainframe Assembler language in the 1960s & 70s for the IBM S/360 & S/370 mainframes) expressed Amdahl's Law (paraphrased): "The maximum value of a speedup (1/(ratio of the runtimes before & after optimizing)) from optimizing a part of a system is bounded by the amount of time a system spends in that part of the system." (For a speedup, if you reduce your runtime from ten seconds (t1) to five seconds (t2), the speedup is t1/t2 = 10/5, for a speedup of 2.0. Reducing from ten seconds to one second has a speedup of 10/1 = 10x.) What Amdahl's Law shows us is that the maximum speedup from optimizing part of a system is bounded by the amount of time a system spends in that part. If you have a choice between optimizing, say, disk IO performance by half or CPU performance by 90%, and your system spends 80 seconds out of 100 in IO, but only 10 seconds out of 100 in CPU processing, with another 10 seconds in "Miscellaneous processing," it actually makes far more sense to optimize the IO subsystem, since a 50% optimization would result in a speedup of around 1.67x (100 / (half of 80 (40) for new IO) + 10 (CPU) + 10 (Misc. time, like memory or something)) -> 100 / 60 -> 1.66666... 100 / 91 -> 1.09890... These are only things you'd know -- where your program is spending its time, and doing what in which inefficient ways -- if you've properly benchmarked or otherwise measured your application's performance characteristics. Using an insight like Amdahl's Law also gives you a place to start to look for optimization opportunities, regardless of if you never calculate actual speedups -- if your program spends 90% of its time in one subroutine, then you know that it makes little sense to start anywhere else in your search for optimization targets, since you could eliminate every other part of the program and only get a
5:30 You could also just use a 27-bit bitset stored in a 32-bit value. Since moving 27 booleans (27 bytes) around is not efficient and such large values can easily lead to cache misses which may take microseconds to be resolved.
@DanDeebster
2 жыл бұрын
Yup, all you need to do it decide whether to write something yourself or use bitvec or bit-vec or bitvector or bv or bitvec_simd or bitsvec or fixedbitset or smallbitvec or awint or bitvec-rs or indexed_bitvec or ...
@user-dh8oi2mk4f
Жыл бұрын
@@DanDeebster just use an u32?
i love these kinds of programming videos! they're so informative and entertaining
Glad you're taking advantage of the Wintergatan creator license
Some awesome conclusions everyone can learn from :D . Cool!
duuuuuuuuude this video is awesome and made so many connections for me from my computer science education. super cool content and quality!!!
@Tantandev
Жыл бұрын
thank you thank you!
Nice video! You could probably multithread the full calculation if you use a "double buffer" system: *two* copies of the simulation data, one is the "previous" (read-only from all threads) the other is "next" (each thread mutably borrows a section). For each step, swap the buffers, split_mut the write buffer, and fan out to N threads that each read from the read buffer and writes cell states to the write buffer for its borrowed section.
I came across you while I was looking for an OpenGL manual. You're a very nice guy
dude, your videos are crazy good!
I was just about to make the same mistakes as you when I saw this video! I was able to take this and apply it on my project. Took it from 18s to 450ms!
great video! general when optimizing code I look for in order: 1) Can we do less work (e.g. iterations) 2) Can we avoid doing work (e.g. caching) 3) Can we spread the workload out (e.g. concurrency)
Microbenchmarking is useful for measuring if your optimisations actually helped, the other important tool to use is a profiler. Profiling your application will tell you where most time is being spent. Profiling your application for the first time almost always finds some things that could be more efficient.
I don't know what's about in this video, but this rejuvenates my love for programming. It made me remember why I feel in love in programming in the first place. Thanks!
"What gets measured, gets managed." - Some body building quote I have applied to most aspects of my life.
Arrays continue to blow my mind
Hash maps only see you a performance increase in the use case you need to do a fast search where there is no way of deriving the values location in memory from a previously accessed one, over the alternative of iterating through a an entire array just to find the location of a single key/value pair
Sweet tutorial fam !
Great video as always!
that was quick, nice vid!
@Tantandev
2 жыл бұрын
My unofficial mentor. Thank you so much for the help!
@leddoo
2 жыл бұрын
@@Tantandev any time, my man!
@leddoo
2 жыл бұрын
looking forward to the new voxel stuff 👌
The code doesn't look like "spaghetti" any more. That "optimized" method for example, looks quite beautiful. It's clear you've got a lot of experience now, even though we always keep learning how to do some things better.
The simple way of parallelizing is having two data buffers: read from one, write to the other. No mutexes or atomics needed, just par_iter on the mutable array. For next generation, swap the buffers.
Loved the video, thank you! Also can't wait for the voxel game video, sounds super exciting! :D
Good thing I never have to worry about writing the optimal solution 😎
this was awesome. learn a lot! thanks!
Another optimization (which you kinda already touched on) is to avoid looping through the neighbours to count them and instead do that using constant offsets and adding them all together in one line. Saves branching and writes.
Love your videos Tantan, I laughed at the Vec, that was great
@Tantandev
2 жыл бұрын
Truly beautiful code
Awesome stuff. Subscribed!!
Your video absolutely amazing entertaining and educational
Speaking of addressing and layout strategies, I wonder if you could encode the data on a hilbert curve. The "lindel" crate has functions that can efficiently convert between points in the cartesian space to an index along the curve. This might be nice for cache coherency.
So many great lessons you've learned. I can *definitely* co-sign the last one at 11:00. This should be taught at Uni. Unfortunately their teaching is generally bad.
Found your channel yesterday and somehow it is perfect for my ADHD
Wisdom at 3:25: "Oh it's empty! I will get some water instead, then." Much better than too many cups!
Definitely a nitpick, but the "slow" code at 2:40 uses HashMap::iter which is very different from indexing the elements individually in order, which is what you appear to have benchmarked (this is what was actually being done in the simulation code (where HashMap::iter (i.e. one-by-one iteration though cells in arbitrary order) definitely wouldn't have been applicable), which is why I said that this is a nitpick). I haven't done benchmarks to confirm this, but I would expect (map: HashMap).iter() to be about as fast as iterating through a slice of map.capacity()* elements, mem::size_of::() in size each. (The std hash map implementation appears to use a flat array of (K, V), so compared to a Vec the need to store the keys makes it a bit bigger and thus less cache-efficient, and that's about it.) Of course none of this applies to one-by-one indexing, where hashing and perhaps the chaotic memory access pattern are the main sources of overhead. *Actually, the number of buckets. Apparently these are not the same thing...
i feel at home with these coding videos that i dont understand but know it will make sense after sometime
What is your greenscreen? It looks well made and I'm looking for one!
Haven't watched the whole video yet, but I have to address an issue. If you want to create a multidimensional Vec and initialize it with a specific default value, the code at 2:00 and 2:05 is unnecessary bloated. First, there is the "vec" macro that works exactly like array initialization. Second, you can index Vecs the same you index arrays. Given these facts, the code can be written like this: fn main() { const SIZE: usize = 10; let states: Vec = vec![vec![vec![0; SIZE]; SIZE]; SIZE]; let cell = states[0][4][3]; println!("{}", cell); } which is pretty much the same as with arrays.
Wow changing wrapper around data can make code run faster!? Mindblowing
Great Video
I haven't tried Rust yet, but this videos is interesting as hell
Excellent !! Thank you. As Knuth said: "Premature Optimization is the work of the devil"
One of the first things my first boss taught me about development was, "Make it work, make it fast, make it beautiful." It's just by far the best approach. Since then, my development workflow has been hacking together something that just does the job, then I optimize it until it does the job well. Then I throw it all away (refactor it) into an architecture that fits the solution.
Amazing attitude!
lol classic example of pre-optimization going horribly wrong! Good thing you got it and listened to people trying to help
That Wintergatan background music... :)
@Tantandev
2 жыл бұрын
many people comment when I use his music. Happy to see other Wintergatan fans :)
So, funny thing, both the KZread algorithm and me both thought this was about the video game Rust. I thought weird topic, but okay, I'll bite. (It doesn't help that the thing in the thumbnail looks like the heightmap for a video game island)
hell yeah, he talked about the code!
You can optimize it even more if you consider that it's safe to run threads on cells that are 1 cell apart (No colliding neighbors) without any locks
Regarding how to store the data: I would use a hashmap of blocks (say 16x16x16). It would easily parallelize as well, as each block could be a task. And we could avoid atomics IMO, perhaps by presaving boundary data, which is an order of magnitude smaller than the actual data.
Look ma! I'm in a niche internet video!
Awesome video
man I love those boolean uses
nice video!
oh man, Wintergatan as BGM? I’m in.
Are you planning on adding examples in the repo? :-)
Could you make a video about your vscode setup?
Hello simple question but can't you just use one single array allocated with like 1x10^N slots and access the element like elem[203] for elem[2][0][3] ? It would be more memory consuming but not time consuming, right ?
A nested vector is no problem if you just put it in a wrapper type with a simplified Api... like get_cell(x, y, z).unwrap() or expect_cell(x, y, z) if you want to drop the unwrap
ayyyy, Wintergatan Music...
awesome. I learned a lot
I can tell you from looking at that that the cached vectors insanity is slower than just returning the values if you don't do something crazy.
nice video :)
if the bounds are limited and if the maximum bound a user can select does not exceed an absurd amount of memory then you could still do the array method, just always allocate the maximum right?
you could store cells in a 3d matrix as a chunk, and have a hash map against those chunks
All I was thinking when you used hashmap and arrays and double passes was. Convolution, use convolution kernels ! instant speed and threading.
lmao the music when the multi threading code comes up.
3:30 the bane of every coffee addict. Get a cup and it basically is empty before you even sit down and start programming
Consider using a hashmap not to store the voxels itself, but to store chunks of voxels, this way you can have expansible chunks of area without dealing with all the performance costs of using the hashmap for every single voxel
Very interesting vid. Thanks for sharing :) slightly random question: if one wanted to learn more about the various mathematics required for this kind of thing what resources would you recommend? Asking for a friend
You are storing an int FFS. Just make the array as large as your potential maximum, alloc it at start up and move on. A single megabyte gives you 100*100*100 of 8 bit ints. We have done this in the real world, stock exchange gateway risk checker, processing 20,000 messages per second on each of up to 20 bi-directional sessions. Wire to Wire latency of less than 20 microseconds, sometimes less than 1 microsecond. (Non commodity hardware). We had a multi-dimensional array of financial instruments containing market data (pointer). We just allocated double the number of possible symbols and allocated a 15Gb Array on start up. If you want performance, use static memory allocation or face doom.
@1over137
Жыл бұрын
Also, remember, (while this might not work in rust).... you can allocate a flat single dimensional array of data and then create as many indexes as you want to access those pointers in whichever way is performant. Obviously this works best when you know the bounds of that data first though.
It would be great if create blog with articles about optimisations
Instead of a hashmap you could use an array set to the maximum size you're planning to use, and then make a getter/setter that clamps your values to the bounds set in the program. Should certainly be faster to iterate over than a hashmap, and depending on the hash function could likely be less memory used.
tantan, 2.05us is not 205000 ns, it's 2050 ns 1us = 10^-6s and 1ns = 10^-9s so 2.05us = 2.05·10^-6s = 2.05·10^3·10^-9s = 2050·10^9s = 2050ns I never comment on KZread but this triggered me sorry lol I love your videos
@mfyoutube5003
Жыл бұрын
ok so this is why you finish the video before commenting friends.
incremental iteration rocks :)
2:00. Why instantiate an object to 0 on O(n), when there is calloc()?
waiting for C++ '23 and '26 flat_map and flat_set until then we can probably use a map to map the key to an index into a vector or flat array, so you still have a flat array to iterate if you choose and constant time accessing.
@ronaldweasly561
2 жыл бұрын
O(1)?
@multiHappyHacker
2 жыл бұрын
@@ronaldweasly561 that sounds more like unordered_map territory