Spark Out of Memory Issue | Spark Memory Tuning | Spark Memory Management | Part 1
This video is part of the Spark Interview Questions Series.
Spark Memory issues are one of most common problems faced by developers. so Suring spark interviews, This is one of very common interview questions. In this video we will cover ffollowing
What is Memory issue in spark
What components can face Out of memory issue in spark
Out of memory issue in Driver
out of memory issue in Executor
How Spark's performance is impacted by Dynamic Partition Pruning
Here are a few Links useful for you
Git Repo: github.com/harjeet88/
Spark Interview Questions: • Spark Interview Questions
If you are interested to join our community. Please join the following groups
Telegram: t.me/bigdata_hkr
Whatsapp: chat.whatsapp.com/KKUmcOGNiix...
You can drop me an email for any queries at
aforalgo@gmail.com
#apachespark #sparktutorial #bigdata
#spark #hadoop #spark3
Пікірлер: 94
So well explained, even the images were very useful. Thank you very much!
It is a great vedio. Content is very useful. Keep it up man 👍🏻👍🏻👍🏻
Great please don't stop from uploading new contents!!
Pure content, great topic, informative, interactive and simple.Thanks you!!
Can you please also show code to repartition and increase executor on dummy process by changing values so that you can show us the impact on the run time of the jobs ? That will be really great to understand concepts
very well expained, thank you
Thank you so much. I m facing many times this auestion recent days. 👍
@DataSavvy
3 жыл бұрын
Thanks :)
Great video.. perfect explanation
Is the 2nd Part not there yet? Your videos are AWSUUMMM !!! :D
Lots of respect for ur content ❤️
@DataSavvy
3 жыл бұрын
Thanks mate
Neatly explained thank you....
Great information...... 👏👏👏
recently discovered this channel. this is gold
@DataSavvy
3 жыл бұрын
Thanks Nikhil :)
@nivedita5639
3 жыл бұрын
True
Very useful videos. Thank you :)
@DataSavvy
3 жыл бұрын
Thanks Nisha
Very useful.please keep making more such videos
@DataSavvy
3 жыл бұрын
Thanks Viraaj :)
Hi Sir, Could you please make a video on the factors that decide the number of tasks, stages, and jobs created after submitting our application.
Your videos on Trouble shooting are pretty good.
@DataSavvy
3 жыл бұрын
Thanks Sree Ram... :)
Great video. Can you share the source of information for further reading?
Can you explain me difference between yarn memory over head vs Spark reserved and user memory?
Thank you . Can you make video about what is Azure Sql?
Very nice video!! thank you
@DataSavvy
2 жыл бұрын
Thanks Ajay
Great video! I had a question regarding the yarn memory overhead. When a pyspark job runs, my understanding is that python worker processes are started within the memory allocated to the executor. JVM then sends data back and forth to these python processes. Won't the allocated python objects use the memory of these python processes instead of the yarn memory overhead?
@Fresh-sh2gc
2 жыл бұрын
the worker nodes run based on resources of the yarn memory. Yarn is normally run on a shared cluster thus there always a tug of war between the tenants of the cluster for memory. as a result, one cannot always use too much memory. However, when there is ample yarn memory there is a process called preemption which gets more memory for the executor memory,
Is there any real-time spark project. Please upload video on it. It would be helpful.
Very helpful
U are one of the best mentor I have ever seen on youtube. The way you explain in awesome and all real-time questions. if my cluster memory is 10 GB and the date we want to process is 20 Gb will it process the data? sir can you please explain this topic
@medotop330
3 жыл бұрын
No you can not process it
@medotop330
3 жыл бұрын
You can do it using MapReduce if it is in batch layer or non used iterative algorithms like machine learning algos
Waiting for Part 2 :)
@DataSavvy
3 жыл бұрын
Will come soon :)
amazing
can you please give example of each OOM what you have explained here, lots of blogs are given with same explanations. what extra is here. please provide with examples. it would be great.
Nice video Sir. And mostly asked question in interview . Could you please make one video, related to other issues we do face in Spark .
@DataSavvy
3 жыл бұрын
Sure Sambit... Do u have any other suggestion on questions?
@sambitkumardash9585
3 жыл бұрын
@@DataSavvy could you please explain , how to deal with the semi structured data, from ingestion to computation .
Recruiters say that you dont have production experience and POC spark working will not help. How can we convince despite having a good understanding of PYspark. Plz sugget
When loading a file to data frame you get oom error, how will u rectify it? Can we get a demo?
Waiting for part 2! :🙈
@DataSavvy
3 жыл бұрын
Working on it... Will post in few weeks. I need to explain one related concept first before that video
i have one doubt: reserved memory and yarn overhead memory are same ? because reserved memory also stored spark internals. Thank you for your time.
Nice Video again Harjeet :) , Hey Can you make videos on Test cases on spark/scala as well, i have scene no one talk about it.
@DataSavvy
3 жыл бұрын
Hi Ravi, test cases are generally about functional and use case specific...
@rajlakshmipatil4415
3 жыл бұрын
Ravishankar Maybe you can try using holdenkarau
@DataSavvy
3 жыл бұрын
Thanks for suggesting... Looks like a good resource... I will go through this github.com/holdenk/spark-testing-base
Hi @all I just got to know about the wonderful videos in datasavvy channel. In that executor OOM - big partitions slide, in spark every partition is of block size only ryt(128MB) , then how come big partition will cause an issue? Can Simeon please explain this? Little confused here Even if there is 10gb file , when spark reads the file it creates around 80 partition of 128mb. Even if one of the partition is high it cannot increase 128mb ryt.. then how come OOM occurs??
Hello. I have 16 crore records on which i want to use window function. But order by is taking huge time and giving memory issue. is there any alternative approach
Good videos
@DataSavvy
3 жыл бұрын
Thanks Ravi :)
Dear Data Savvy, Could you please clarify, if we go for broadcast join mean, it copies the small file into all available executor memory right? how come it causes the driver out of memory exception.
@DataSavvy
3 жыл бұрын
That file is first brought on driver and merged(if it has multiple partitions) then it is sent to executors
@ANUKARTHIM
3 жыл бұрын
@@DataSavvy Thanks for the answer
@DataSavvy
3 жыл бұрын
Thanks
@svsvikky
3 жыл бұрын
@@DataSavvy Isn't brodcast done executor-executor similar to bittorrent? Please correct me if i am wrong
Nice Video. Question: In case when we call coalesce(1), does it causes any OOM issues either in driver or executor? if calling this operation does not through any OOM what could be the reason? Please clarify.
@DataSavvy
3 жыл бұрын
U are right... Coalesce can also cause memory breach in few situations...
@kiranmudradi26
3 жыл бұрын
@@DataSavvy Thanks. In that case OOM will happen at executor side not at driver side. is my understanding correct?
@DataSavvy
3 жыл бұрын
Yes...
@DataSavvy
3 жыл бұрын
Wait... A correction here... Repartition (1) can cause issue , not coalesce (1) as coalesce will not cause shuffle and data will stay on same machines...
@kiranmudradi26
3 жыл бұрын
@@DataSavvyThanks. i was about to ask the same question. u replied in time. Kudos
question if i use pyspark do i still get does errors ?? another question in instead of collect what other command ca we use
@vikaschavan6118
3 жыл бұрын
SaveAsFile instead of collect
Spark on kubernetes works completely different. This works only for spark on hadoop.
Where is the second part ?
Issue : container killed by yarn . Spark application Exited 1. This is the most common in aws glue or any spark jobs . increasing spark.yarn.executor.memoryOverhead and spark.yarn.executor.memory willl help but make sure it shouldn't increase than the total yarn.nodemanger memory or else there'll be a issue of configuration.
Why use rdd in all question?? Why not dataframe?
Is groupbykey also cause of Out of Memory Right
@DataSavvy
3 жыл бұрын
U are right... If there is skewness in data...vin case of group by key, we can end up facing Memory issue
how would we know that which file is small and which file is larger . one interview asked this question to me.
@DataSavvy
6 ай бұрын
You can list the files in folder and see the size of file... Hdfs fs -ls... This is command
@user-dl3ck6ym4r
6 ай бұрын
thank you@@DataSavvy
@user-dl3ck6ym4r
6 ай бұрын
but i am using s3 bucket so@@DataSavvy
Part 2??
i cant able to join your whatsapp group i am facing some issue in my local machine while setting up spark; please let me know where to post my query
@DataSavvy
3 жыл бұрын
Please join telegram group and send query there... We have moved to telegram... Http://t.me/bigdata_hkr
@amitpadhi2717
3 жыл бұрын
@@DataSavvy aforalgo@gmail.com dropped a mail already could you please check the issue which i faced
I am learning concepts but without real time experience I am not able to get practice on Data Collection from various sources I am able to clean the data well using Pyspark and can do ML using Spark ML by MLlib library But please suggest some sources to practice for Data Collection from various sources Thank you
@DataSavvy
3 жыл бұрын
Sure, let me look into this and I will share some link... You can join our document library and data Savvy group... U will get lot of relevent information there
How to avoid collect operation
@DataSavvy
6 ай бұрын
You usually don't need collect.. Can you give an example where you are using it.. I can suggest, how to avoid it and rightly code
The whatsapp group is full
@DataSavvy
3 жыл бұрын
Yes... Please join telegram group
@RajuSharma-qd2uv
Жыл бұрын
Can you pls share your telegram group name?
Who is the person who dislikes this video... I think.. frustrating with life or wife... 😀😀😀
@DataSavvy
3 жыл бұрын
Ha ha ha 😀