How to write efficient and fair multi-threaded programs?

Ғылым және технология

System Design for SDE-2 and above: arpitbhayani.me/masterclass
System Design for Beginners: arpitbhayani.me/sys-design
Redis Internals: arpitbhayani.me/redis
Build Your Own Redis / DNS / BitTorrent / SQLite - with CodeCrafters.
Sign up and get 40% off - app.codecrafters.io/join?via=...
In the video, I discussed the importance of writing good multi-threaded programs for optimal performance. I emphasized the need for correctness, fairness, and optimality in multi-threaded code. By using locking mechanisms and atomic instructions, we can ensure correctness by avoiding race conditions. Additionally, ensuring fairness among threads is crucial to optimize performance. Through examples of counting prime numbers sequentially versus using multiple threads, I demonstrated how fairness can significantly enhance program efficiency. By distributing work evenly among threads, we can achieve faster and more efficient multi-threaded programs.
Recommended videos and playlists
If you liked this video, you will find the following videos and playlists helpful
System Design: • PostgreSQL connection ...
Designing Microservices: • Advantages of adopting...
Database Engineering: • How nested loop, hash,...
Concurrency In-depth: • How to write efficient...
Research paper dissections: • The Google File System...
Outage Dissections: • Dissecting GitHub Outa...
Hash Table Internals: • Internal Structure of ...
Bittorrent Internals: • Introduction to BitTor...
Things you will find amusing
Knowledge Base: arpitbhayani.me/knowledge-base
Bookshelf: arpitbhayani.me/bookshelf
Papershelf: arpitbhayani.me/papershelf
Other socials
I keep writing and sharing my practical experience and learnings every day, so if you resonate then follow along. I keep it no fluff.
LinkedIn: / arpitbhayani
Twitter: / arpit_bhayani
Weekly Newsletter: arpit.substack.com
Thank you for watching and supporting! it means a ton.
I am on a mission to bring out the best engineering stories from around the world and make you all fall in
love with engineering. If you resonate with this then follow along, I always keep it no-fluff.

Пікірлер: 71

@sourabhsingh5286 Жыл бұрын
Thanks for bringing back the joy of learning about computer science. It had become too much about placements/packages etc… 💪🏼🙏
@yogeshedekar60787 ай бұрын
This is so darn interesting honestly and moreover I was coding side by side in java as you were explaining. I have been running away from threads right from my college days and it seems like first time some one has explained the intricacies of multi-threading to me after speding 15 years in IT. The reveletion seems fantastic. I am running JDK 15 on a 64 bit windows 10 :) and it took me 38 seconds with fair parallelism with 10 threads. Thanks a lot for this wonderful series.
@biswaprakashmishra398 Жыл бұрын
Had to watch the 3rd approach 2 to 3 times to understand how the atomic increment is happening and each goroutine is getting a number to determine if it is prime or not. Absolutely brilliant. Excited for this series.
@Aryan0 Жыл бұрын
This video is super helpful. I'm a CS sophomore but I have also learned a lot from your journey and also the importance of getting the basics right in computer science field.
@MurtazaLokhandwala7 ай бұрын
Very well explained! This is really helpful. Thanks a lot!
@shishirchaurasiya7374 Жыл бұрын
Its really really amazing the last 3rd explanation where you explained keeping the threads busy all the time utilizing the CPU cycles as much as we can was the pure WOAH !! wala moment... Thanks a lot
@subhamtripathi19979 ай бұрын
this is just another level for attention to details!
@saxena11ashish7 ай бұрын
And this is why I love engineering. Perfect explaination for something that is valuable but not given enough thought 💯
@tedcse7 ай бұрын
Beautifully explained!
@piyushtyagi2131 Жыл бұрын
Brilliant Brilliant explanation ❤️ thanks for sharing.
@akshayrr3061 Жыл бұрын
Amazing content! Waiting for more videos on concurrency
@nithinsastrytellapuri291 Жыл бұрын
thanks for making this free
@akshaymuthal3693 Жыл бұрын
Watching your videos is like an eye opening experience in the world of CS!
@AsliEngineering
Жыл бұрын
Exactly my motive. I want people to fall in love with computer science and become better engineers. People have become really shallow and I want to change it.
@harshitbangar123 Жыл бұрын
Another strategy can be for n threads picking nth number. Each thread (or worker) maintain its own version of counter (thread local) and do the aggregation in the end.
@jayaramans7398 Жыл бұрын
Well curated content. Can you please explain how this use resources in machine cpu
@sachinjindal4921 Жыл бұрын
Awesome
@faizm4765 Жыл бұрын
Just like always beautifully explained Arpit! Learnt something new and definitely going to marinate and play around with it more.
@adwaithks
Жыл бұрын
😊
@adwaithks
Жыл бұрын
😊
@sraynitjsr10 ай бұрын
Awesome 🔥
@Pro3512 Жыл бұрын
Awesome @Arpit.Just one thing can we not use lock for incrementing prime numbers instead of atomic for mutilthreading ?
@NaveenSingh-ri4mu10 ай бұрын
Thanks!
@ShrihariShetty11 Жыл бұрын
Really Loved your Explanation❤. Fairness among threads is something I never thought of, you just unlock areas in brain🧠 to get deeper understanding /perspective on problems and patterns. I went ahead and implemented the above in Java, made a lot of mistakes but learned a lot in that process. I tried a slightly different approach 🤔 where I gave an additional responsibility to individual threads. The responsibility was that the thread should also return the computed value(count of prime numbers) for the batch of numbers that was assigned to it. Lastly I just sum up the individual values to get the final answer. The program took 4 seconds to compute all the prime numbers between 1 and 100 million. Number of threads used were 5. Environment: Mac M1 8 core. It will be really helpful if you also share your thoughts(best practices/scope) on using Intrinsics (SIMD instructions) for solving a similar problem.
@AsliEngineering
Жыл бұрын
I am so happy reading this. thanks a ton :) 5 seconds seems too less :) may be it is M1's magic :) also, given you have 8 cores, you should have at least 8 threads. also, by returning the count from each thread, you are reducing the overhead of locking and/or atomic update, which is a good decision.
@shaurya3284
Жыл бұрын
@@AsliEngineering i too did all the 3 variations in java, while the first two(single thread - 62 seconds, batched multithreading 11 seconds on a macbook pro m1 (2022)) worked exactly how i'd expect it to, while trying out the fairness logic, it is taking even more time because each time the checkPrime(num) is called, in order to ensure thread safety we need to add this logic under synchronized in java else we might get a wrong ans, if i am wrong please do correct me and i believe this is why it is also taking more time (253 seconds)
@PiyushBajaj05
11 ай бұрын
@shaurya3284 Is it possible for your Java solution for the same, want to validate on this?
@MrSnippyv Жыл бұрын
The idea of using atomic integer to distribute the integers to the threads in the fair implementation was pretty neat 😅, I had the idea of using a wait-free, lock-free queue to distribute the numbers (I was not seeing them as just numbers, I was seeing them as "inputs", so using a queue to distribute them made sense to me when I thought about it). I was kinda surprised when I saw the difference in the execution time of fair and the unfair implementations, it was not as substantial as I imagined, may be it will be more substantial if the limit increases. Also, would it make a difference if the checkPrime() didn't use a shared atomic integer, rather each thread will gets its own int variable to increment, and you add up the values of all those integers at the end? I should take a look at the assembly of atomic add.
@vibhuyadav6841 Жыл бұрын
Thank you for Sharing this. This is such a beautiful explanation. I have learnt a lot from this it was really helpful. I also went ahead and implemented the same in Java. Mind blowing Results ❤. Awesome.
@PiyushBajaj05
11 ай бұрын
@vibhuyadav6841 Is it possible for your Java solution for the same, want to validate on this?
@1711nitish Жыл бұрын
Another issue with spawning unnecessary threads is this could lead to CPU spikes in large scale applications .
@abhishekmehandiratta4241 Жыл бұрын
We could also have a channel with a buffer size of 100million and push every number to that channel. This could serve as the input channel for the goroutines and each of those goroutines could iterate over values of that channel and process the number. The main goroutine could close the channel immediately after pushing all the 100mil values. This way we won't need a global shared variable to keep track of the current number. This approach is very similar to a queue based approach with Kafka/RabbitMQ. Great video btw!
@nikhilneela
9 ай бұрын
wow! brilliant idea. Maintaining queue might be memory overhead for such small application, but yes can be a great idea for something big. this is a lock free solution
@chapristuff8460 Жыл бұрын
Hi arpit, interesting video. Would love to know how you would have sized your threadpool for a particular problem.
@AsliEngineering
Жыл бұрын
Size of threadpool is always a factor of outbound work ( writing to network, disk, or cpu computation) and it depends on the usecase at hand. I typically run a small test to tune the threadpool by having a very aggressive monitoring setup. Start small, and then increase until my IO maxes out.
@shaktijain85603 ай бұрын
Hi Arpit, I really enjoyed your amazing video and appreciated the clear explanation of concepts. However, I couldn’t help but notice the relatively low view count. A small suggestion from my end would be to include code in the video, particularly in Java or C++, as I believe it would help viewers relate more.
@nandagopal.057 ай бұрын
This is really helpful. A small change in approach makes a huge difference. On a similar note, i was trying to code where i will be able to use near 100% of my cpu capacity. So basically if i have some work to do that i can distribute among processes and threads, how i can achieve this in a minimum amount of time by utilising 100% of my computing power. I am not sure whether my thinking is correct but would love to get some clarification to my thinking here
@adsharma916 Жыл бұрын
Would be nice to see how this scales with concurrency factor. There should be an upper bound. So what happens when we have 10, vs 20 vs 50
@harishreddy3310 Жыл бұрын
I think it would be faster with input channel which just inputs each number and a list of workers taking the numbers from the channel and incrementing their local counter and at the end outputting their count to another channel, you won't need to do atomic operations so frequently.
@anishdabhane913220 күн бұрын
Hi Aprit sir, i discovered this video through ur recent tweet on "NON-ATOMIC NATURE OF COUNT++" , the video was really fascinating . Recently I got exposed to mutex and semaphore in college and assembly language was taught in 2nd year itself . Through this video i ended up diving deep into how cpu works (cores and threads) and after spending 2-3 days finished C++ implemention of the same. Thanks a ton for the awesome content! I'm excited to finish the CONCURRENCY playlist and write some blogs about it too.
@AsliEngineering
15 күн бұрын
Keep digging deep, because that's where the magic happens :) Glad to see you are chasing the right things in the early stages.
@anishdabhane9132
15 күн бұрын
@@AsliEngineering 💯
@liquidmetal718 Жыл бұрын
Aren't you using go routines instead of threads ? Go scheduler has this m:n mapping between threads and go routines right ? Also, go-routines get scheduled using that work stealing algorithm on top of those threads. We didn't modify the thread count & only changed the go routine count. Maybe we can make it better by modifying both. In Java or Rust we use threads but not co-routines. Do let me know if i'm wrong.
@srirangamikshwak9471 Жыл бұрын
How to decide on number of threads that we can use in any scenario? Does it always depend on number of cores? Any reference would be great.
@krsingh.shubham3 ай бұрын
Amazing
@rahulsangvikar7973Ай бұрын
Absolutely amazing video, however, I'd like to point out a slightly different way of writing a lock-free fair implementation. Say you're using n threads. Start ith thread from i, and increment the counter by n each time. This way you will always calculate isPrime for a unique number in all the threads and the numbers will be fairly distributed among all of them.
@shaifibansal502211 ай бұрын
Beautifully explained video but I have few doubts : can we know which method would be better for a function ? with threads or without threads, without running it and also why are threads always faster in this case as there is context switching between threads which also takes time but if we don't use threads then won't that thread switching time should get saved? Also there must be a point after which threads might become costly like in this case if we create threads equal to the number of digits for which we want to check for prime or not then it won't be efficient. So how can we find using how many threads will be best ? Really hoping for a response
@photon628 Жыл бұрын
Nice, write the examples in Go 👍
@DharanAditya Жыл бұрын
This was insightful video. A small great. In the 3rd approach, I see you are breaking the loop when x > MAX_INT. But since we are checking for prime number till 100M, should we breaking when x > 100M. Was it just for demonstration purpose to show the execution speed ? Or there is something i'm missing out
@AsliEngineering
Жыл бұрын
Max int is set to 100m it is not integer max. That's that global variable set.
@sachinmukherjee29 Жыл бұрын
Very good explanation. Is it possible to share this code repository?
@AsliEngineering
Жыл бұрын
not sure how to share them with members-only. let me figure out the logistics. meanwhile I would highly recommend you to write the same code and make mistakes. your understanding will multi-fold.
@parthtrehan838714 күн бұрын
But the atomic section still remains inside the parallel region. I think pulling it out would give us more benefit here.
@syedfarhan7105 Жыл бұрын
If we only have one CPU then will multi threaded approach gives any benefit? I think this will only work in a multi core settings. Is my understanding correct?
@eren-qu9uu7 ай бұрын
average of second approach is 45 second while in 3rd approach average is 51 so i think there is overhead of 5-6 seconds to pick next number using go automic for 100 million number.
@krishnarajsivaraj Жыл бұрын
the sum of the execution time of all threads in the 3rd approach is more than that of the 2nd approach. the increase in execution time is due to the atomic increment. We can improve the execution time by reducing the atomic increment as much as possible. You can see an improved version here - gist.github.com/kriznaraj/466291a46621507a09e9c3988a59147c ~15-20% improvement is seen in the overall execution time
@pratikkedar4476
Ай бұрын
Yeah its true, I wrote c++ routine to validate this at my end, second approach took 28 sec for 100M numbers where i see variable execution time for each thread, with approach 3 each thread is taking same time but overall time has increased, i believe this is due to atomic variable access! @arpit any comments ??
@mayanknegi24717 ай бұрын
Every thread is reading value from shared variable x in 3rd approach. Why aren't there duplicate counts? What if multiple threads read same x value and increase count multiple times for same prime number?
@AsliEngineering
7 ай бұрын
Because no two threads are picking up the same number, ever.
@miralshah476110 ай бұрын
Fare way is to use queue the numbers and each thread dequeue each number and process it , No need to increment number.
@arihannnt14 күн бұрын
Interesting! Thanks for sharing this, I was doing it wrong :|
@ayushanand90137 ай бұрын
Incredible, the 3rd approach is a standard use of mutexes. Although, as far as the problem is concerned (and limited to this problem only), can we do something in line with natural distribution of primes over the number line. This wikipedia article gives out an approximate number of prime numbers within say number N. Given, if we know the number of Worker Threads, and the maximum limit (which is the case in the question), we can distribute the number range with a logarithmic decay. Although that might be too much maths for a problem from a CS perspective. 😅 Ref: en.wikipedia.org/wiki/Prime_number_theorem
@AsliEngineering
7 ай бұрын
Yes. Totally.
@carlosluque475310 ай бұрын
Wooow that is a seriously good explanation!
@7guitarlover Жыл бұрын
wouldn't a number be prime if its not divisible by all the prime numbers that came before it ? ( barring the exception of 2 ) wont that be a better approach to find prime numbers instead of iterating over to sqrt( number) ?
@AsliEngineering
Жыл бұрын
yes but that's not the point.
@lakshminarayanannandakumar128610 ай бұрын
Hi Arpit, I was trying out your code with minor tweak as below From ``` x := atomic.AddUint64(&CURRENT_NUM, 1) // Breaking condition if x > N { break } ``` To ``` atomic.AddUint64(&CURRENT_NUM, 1) // Breaking condition if CURRENT_NUM > N { break } ```` However, the program yields inconsistent results. Could you please explain what is causing such a big difference?
@sankeerthnalam512 Жыл бұрын
One important thing to note here is that the example taken in the video is 100% CPU intensive(no I/O required). In these cases the performance of the application will depend on the no of cores in the machine.If the machine only has 1 core then no mater whatever approach we follow we will end up with approximately same time.
@rahulsangvikar7973
Ай бұрын
The unfair threading code will always run slower than the fair one, irrespective of the number of cores being used. The flaw is in the software itself, not the hardware.
@sarthakjain45717 ай бұрын
Awesome