Thundering Herd Problem and How not to do API retries

Ғылым және технология

System Design for SDE-2 and above: arpitbhayani.me/masterclass
System Design for Beginners: arpitbhayani.me/sys-design
Redis Internals: arpitbhayani.me/redis
Build Your Own Redis / DNS / BitTorrent / SQLite - with CodeCrafters.
Sign up and get 40% off - app.codecrafters.io/join?via=...
In the video, I explained the Thundering Herd problem that occurs when numerous clients simultaneously retry API calls, overwhelming the server. I discussed the implications of this issue and presented a solution involving adding random jitter and exponentially spacing out retries. This strategy prevents server overload and allows for better recovery. It's crucial to implement these techniques in retry logic to avoid server strain. The video aimed to educate viewers on this common problem and offer practical solutions for effective system design.
Recommended videos and playlists
If you liked this video, you will find the following videos and playlists helpful
System Design: • PostgreSQL connection ...
Designing Microservices: • Advantages of adopting...
Database Engineering: • How nested loop, hash,...
Concurrency In-depth: • How to write efficient...
Research paper dissections: • The Google File System...
Outage Dissections: • Dissecting GitHub Outa...
Hash Table Internals: • Internal Structure of ...
Bittorrent Internals: • Introduction to BitTor...
Things you will find amusing
Knowledge Base: arpitbhayani.me/knowledge-base
Bookshelf: arpitbhayani.me/bookshelf
Papershelf: arpitbhayani.me/papershelf
Other socials
I keep writing and sharing my practical experience and learnings every day, so if you resonate then follow along. I keep it no fluff.
LinkedIn: / arpitbhayani
Twitter: / arpit_bhayani
Weekly Newsletter: arpit.substack.com
Thank you for watching and supporting! it means a ton.
I am on a mission to bring out the best engineering stories from around the world and make you all fall in
love with engineering. If you resonate with this then follow along, I always keep it no-fluff.

Пікірлер: 28

@santoshreddy9628 Жыл бұрын
Thank you Arpit it's very clear and simply explained 👍
@aashishgoyal1436 Жыл бұрын
Thanks a lot Arpit. U explained it really well
@architshukla8076 Жыл бұрын
Thanks Arpit Very Good Information !!
@hiteshbitscs Жыл бұрын
One thought Arpit.. What if we use circuit breaker.. in case of many failures it goes to OPEN state(means don't call downstream) and in configurable time it goes to HALF_OPEN state. If calls are still failing it goes to OPEN and won't call downstream. If downstream recovered then it goes to CLOSE state which means allow call to happen. Do you think this is also one of the solution?
@kishorkumrbendi Жыл бұрын
This is also called circuit breaker design pattern in Micro services. We need to handle the API failures by using some retries and jitter until the service is up. Meanwhile return the service unavailable response to the client. Am I correct Arpit ?
@multi21seb3 ай бұрын
Hey Arpit, this is great, thanks. But what about from server's perspective? Suppose its your server who is facing the thundering herd problem, how would you handle it in such a case? We don't have control over the clients, they may or may not retry with backoff or just keep bombarding our server with requests. I felt this was missing in this video. Few things come in my mind, do let me know what else I'm missing: rate-limiting(start dropping reqs all together), tweak the LB to distribute the load to different servers, power of two choices algorithm(?), caching and auto-scaling your replicas could also help.
@rajdave73575 ай бұрын
Awesome 👍❤ bro
@jaisamtani303 Жыл бұрын
My organization doesn't follow what mentioned in this video but it handles in much better way and we call that as "Activating/Deactivating Emergency mode" in much simpler way.
@jilvasheth7127
Жыл бұрын
Can you please explain that in brief? Would like to know the other approach as well.
@jaisamtani303
Жыл бұрын
@@jilvasheth7127 sorry bro I can not tell. If my company shares a technical blog then I will share the link with you
@ajnabee01
Жыл бұрын
@@jaisamtani303 why not? That is not confidential.
@gmmkeshav Жыл бұрын
bhai ap great ho 🙏
@adityasheth Жыл бұрын
Hi Arpit, thanks for the amazing video. I have a question : shouldn't the range of value of random jitter that you mention also increase exponentially as the number of retry increase? Otherwise the range of time when spike happen will be in a certain small range, right?
@AsliEngineering
Жыл бұрын
Not necessarily. But implementation can vary.
@Aditya-rs5dj11 ай бұрын
kya gajab padhate ho bhai
@soumyaranjanpatel1346 Жыл бұрын
Now I know why the retries are scheduled like this. Your videos are amazing as always.
@Mrchnks11 ай бұрын
Nice
@LeoLeo-nx5gi Жыл бұрын
I believe we can use queue/pipeline based approach in that case (to avoid such bombarding), please correct me if am wrong...great short video though !!
@AsliEngineering
Жыл бұрын
Not everything can be modelled in a queue.
@gauravaws20
Жыл бұрын
This is what you do on the client side. Ofc you still have to do something on the server side to avoid this. Using queue could be on of the approaches but then you’re getting into a totally different kind of architecture. Load shedding, throttling and server side circuit breaking could be other approaches too.
@sarveshdubey9347 Жыл бұрын
I guess you would have learned this first at Amazon? We use exponential backoffs + jitter for any retry at Amazon and even AWS SDK directly uses it in built.
@AsliEngineering
Жыл бұрын
No. Learnt it as a fresher when working with a latency sensitive service.
@a..b1343 Жыл бұрын
Isn't there some amount of jitter already with different clients having different network speeds, machine specs etc?
@AsliEngineering
Жыл бұрын
Yes but still that's not enough in real world scenarios
@a..b1343
Жыл бұрын
Oh, got it. Thanks for clarifying!
@technogeek8306 Жыл бұрын
Even if we retry with Jitter but what about new requests that will come to our APIs, our API server will still be flooded with requests that are waiting to be served, right? (Just trying to understand how we can solve that problem?)
@AsliEngineering
Жыл бұрын
Yes. But not all retries will come at the same time giving server the necessary breathing space.
@asifmohammed5436
Жыл бұрын
do the following. at the server, if you receive more requests from clients within x period, send service unavailable as the response. and the client end, if you receive this status code, use exp back off. this will optimize both ends. also make sure to have auto-scale deployment configuration with min and max number of servers, if the number of connections increase by some x percentage.