Spark Job, Stages, Tasks | Lec-11
In this video I have talked about how jobs, stages and task is created in spark. If you want to optimize your process in Spark then you should have a solid understanding of this concept.
Directly connect with me on:- topmate.io/manish_kumar25
For more queries reach out to me on my below social media handle.
Follow me on LinkedIn:- / manish-kumar-373b86176
Follow Me On Instagram:- / competitive_gyan1
Follow me on Facebook:- / manish12340
My Second Channel -- / @competitivegyan1
Interview series Playlist:- • Interview Questions an...
My Gear:-
Rode Mic:-- amzn.to/3RekC7a
Boya M1 Mic-- amzn.to/3uW0nnn
Wireless Mic:-- amzn.to/3TqLRhE
Tripod1 -- amzn.to/4avjyF4
Tripod2:-- amzn.to/46Y3QPu
camera1:-- amzn.to/3GIQlsE
camera2:-- amzn.to/46X190P
Pentab (Medium size):-- amzn.to/3RgMszQ (Recommended)
Pentab (Small size):-- amzn.to/3RpmIS0
Mobile:-- amzn.to/47Y8oa4 ( Aapko ye bilkul nahi lena hai)
Laptop -- amzn.to/3Ns5Okj
Mouse+keyboard combo -- amzn.to/3Ro6GYl
21 inch Monitor-- amzn.to/3TvCE7E
27 inch Monitor-- amzn.to/47QzXlA
iPad Pencil:-- amzn.to/4aiJxiG
iPad 9th Generation:-- amzn.to/470I11X
Boom Arm/Swing Arm:-- amzn.to/48eH2we
My PC Components:-
intel i7 Processor:-- amzn.to/47Svdfe
G.Skill RAM:-- amzn.to/47VFffI
Samsung SSD:-- amzn.to/3uVSE8W
WD blue HDD:-- amzn.to/47Y91QY
RTX 3060Ti Graphic card:- amzn.to/3tdLDjn
Gigabyte Motherboard:-- amzn.to/3RFUTGl
O11 Dynamic Cabinet:-- amzn.to/4avkgSK
Liquid cooler:-- amzn.to/472S8mS
Antec Prizm FAN:-- amzn.to/48ey4Pj
Пікірлер: 157
Directly connect with me on:- topmate.io/manish_kumar25
In Apache Spark, the spark.read.csv() method is neither a transformation nor an action; spark.read.csv() is a method used for initiating the reading of CSV data into a Spark DataFrame, and it's part of the data loading phase in Spark's processing model. The actual reading and processing of the data occur later, driven by Spark's lazy evaluation model.
@ChetanSharma-oy4ge
4 ай бұрын
Fir action jobs kaise ban rahe hai? Mtlb ager action is equal to jobs , to better way kya hai find out kerne ka?
@roshankumargupta46
Ай бұрын
If you don't explicitly provide a schema, Spark will read a portion of the data to infer it. This triggers a job to read data for schema inference.
Thanks Manish Bhai...Please keep continue your video
Wow Kya clear explanation tha,first time understood in.one.go
I really liked this video....nobody explained at this level
bro next level ka explanation tha... thanks for sharing your great knowledge. keep up the good work. Thanks
Didn't find such a detailed explanation, Kudos
nice explain ,each and every concept you clear keep it up
This was so beautifully explained.
One of the best videos ever . Thank you for this . Really helpful.
Great Explanation ❤
Explained so well that too bit- by- bit 👏🏻
Really Awsome Explanation ! Esa Explanation kabhi or ni mil sakta hai Thank you so much
Great Manish. I am grateful to you for making such rare content with so much depth. You are doing a tremendous job by contributing towards the Community. Please keep up the good work and stay motivated. We are always here to support you.
glad to see you brother
great job bro, you are doing well.
Very well explained bhai.
Bhai was eagerly waiting for your videos
Your channel will grow immensely bro. Keep it up❤
nice explanation.
Nicely explained
Bhai bahut sahi explain karte ho aap
very good content. Please make detail videoes on spark job optimization
Very good explaination bro
Thank you so much Manish
excellent!!
thank you sir..
VERY VERY HELPFUL
bhaiya you are grate
Start a playlist with guided projects ,,so that we can apply these things in real life..
1 job for read, 1 job for print 1 1 job for print 2 1 job for count 1 job for collect total 5 jobs according to me but i have not run the code not sure
waah
I remember wide dependency you explained in shuffling.
Bro amazing explanation
Glad to see you manish... Bhai any update on project details?
bro, how many more days will spark series take and will you make any complete DE project with spark at last. BTW watched and implemented all your theory and practical videos. Great sharing❤
If you don't explicitly provide a schema, Spark will read a portion of the data to infer it. This triggers a job to read data for schema inference. If you disable schema inference and provide your own schema, you can avoid the job triggered by schema inference.
@arpanscreations6954
Күн бұрын
Thanks for your clarification
One question: after groupby by default 200 partitions will be created where each partition will hold data for individual key. What happens if there are less keys like 100 , will it lead to formation of only 100 partition insted of 200? AND What happens if the individual keys are more than 200 in number, will it create more than 200 partitions?
❤
what if the spark.sql.shuffle.partitions is set to some value, in this case what will be the no of tasks after/in groupby stage ?
Num Jobs = 2 Num Stages = 4 (job1 = 1, job2 = 3) Num Tasks = 204 (job1 = 1, job2 = 203)
I have one doubt there are 3 actions are there such as read,collect and count, but why it is creating 2 job only ?
@tejathunder
2 ай бұрын
In Apache Spark, the read operation is not considered an action; it is a transformation.
An Executor can have 2 partitions or is it like that partitions means it will be there on two different machines.
Is 200 default task in group by even if only 3 distinct ages are there?If so what will be there in rest of the 197 task (which age group will be there)
Ek question tha ki, order kya hona chahaye likhne ka, Mtlb ki ager hum filter/select/partition/group by/distinct/count ya or kuch bhi ker rahe hai to, sabsay pehla kya likhna chahaye…
Hi Manish, in the second job there were 203 tasks and 1st job there was 1, so in total 204 tasks are there in complete application? i am bit confused between 203 and 204. Kindly clarify..
@manish_kumar_1 : correction - job2 - stage2 is till group by, and job2 - stage 3 is till collect
Hii sir, i have one doubt like collect will create the task and stage or not because you mentioned like 203 task
.count on group dataset is transformation not action, correct ? if it was like employee_df.count() then it would be action
Hi bhaiya. Why havent we considered collect() as a job creator here in the program you discussed?
Hi @manish_kumar_1, I have one question in wide transformation you said that in groupBy stage3 there will be 200 tasks according to the 200 partitions. But can you tell me why these 200 partitions happened in the first place.
What command did you use to run the job?
Hi Manish, Count() is also an action right ? If not can you please explain what is count()
Sir make playlist for other data engineer tools also
I have been waiting for your video how many more days will it take spark series to complete
bhai job 2 to collect se chalu hoga na? to read ke bad se collect tak job 1 hi chalega na?
Thanks Sir
Can i know about executors. How many executors will be there in worker node? And Is the no.of executors depend on no.of cores in worker node?
Same example I tried and in my case, 4 jobs are created. Is there any other config needed?
One query @Manish , spark.read is a transformation and not an action right?
kya hota agar filter ke baad ek aur narrow transformation hota like filter --> flatmap--> select iska kitna task banta ?
After repartition(3) Still 200 default partition will show there on dag Sir
Nice explanation 👌 👍 👏 but can you please explain in English. Every one can see all over the world 🌎 ✨
Manish bhai total kitne videos rhenge theory or practical wale series mein?
7:40 print is an action ,so it should be 4 job in given code. ryt????correct me if i am wrong
count is also a action, so there would be 3 jobs?
Hi Manish, In databricks also when groupby() invoke it create 200 task by default ? How to reduce 200 task when using group by() for optimizing spark job ?
@manish_kumar_1
Жыл бұрын
There is a configuration which can be set. Just google how can I set fix number of partition after join
Count is also an action.
@manish_kumar_1
2 ай бұрын
And transformation too
@manish_kumar_1
2 ай бұрын
And transformation too
do you have English version of videos
Sir, in line 14 - we have .groupby and .count .count is an action, right? Not sure if you missed it by mistake or if it doesn't count as an action? 🙁
@tanmayagarwal3481
4 ай бұрын
I had the same doubt,Did you get the answer to this question? As per the UI also it has mentioned only 2 jobs whereas count should be an action :(
pls enable subtite in all video bro
When I ran in notebook, it gave 5 jobs like below, and not only 2 for this snippet of code. Can you explain.: Job 80 View(Stages: 1/1) Stage 95: 1/1 Job 81 View(Stages: 1/1) Stage 96: 1/1 Job 82 View(Stages: 1/1) Stage 97: 1/1 Job 83 View(Stages: 1/1, 1 skipped) Stage 98: 0/1 skipped Stage 99: 2/2 Job 84 View(Stages: 1/1, 2 skipped) Stage 100: 0/1 skipped Stage 101: 0/2 skipped Stage 102: 1/1
Bhai, I'm using Spark 3.4.1 and in that when I group data using groupby (I have 15 records in dummy dataset) it create 4 jobs to process 200 partitions why ? Is this the latest enhancement ? and not only in latest version but also in spark 3.2.1 I observed same thing. Could you please explain this ?
@villagers_01
27 күн бұрын
bhai job action pe create hota h but no. of action != no of jobs. kyonki jobs create hota h jab new rdd ka jarurat hota h. in groupby we need shuffling of data aur rdd immutable hota h to naya rdd banana hi padta h after shuffling. isliye jab bhi naya rdd banane ka jarurat hota h to ek job create hota.aapke case 1 job read ke liye, 1 job schema ke liye,1 job shuffling ke liye aur ek job display ke liye.
could you please upload the required files, I just want to run and see by myself.
why not stage 4, becuse you say each job has minimum one stage and one task so why job 3 don't included to stage and task ?
@manish_kumar_1 In the previous videos you said like count() as action, but in these video you are not taking that as action, WHY ??
I have a question, if one job has 2 consecutive wide dependency transformation then 1 narrow dependency and again 1 wide dependency how many stages will be created. Suppose repartition, after that groupby, then filter and then join, how many stages will this create?
@jaisahota4062
7 күн бұрын
same question
Hi manish, i have a doubt in groupby count is also a action then why it is not counted as a action?
@manish_kumar_1
Ай бұрын
Aage kuch videos me clear ho jayega
Hi Manish , count is also an action and you have written count just after group by in code snippet,why count is not considered as job here.
@manish_kumar_1
11 ай бұрын
Aage me videos me iska explanation mil jayega. Count action v hai and transformation v. Kab kon sa hoga uske liye aage videos me detailed me explain kara hai
So if any group by is there we have to consider 200 task?
@manish_kumar_1
2 ай бұрын
Yes if aqe is disabled. If it is enabled then count depends on data volume and default parallelism
why was count() not counted as action while countuing in jobs ?
@manish_kumar_1
Ай бұрын
Aage ke lecture me samjh me aayega
koi baat nhi bhaia, bs ye series poora khatam kr dena kyoki etne detail me yt pe kisi nhi nhi btaya hai.
Hi manish , will count() not be considered as a action ?
@manish_kumar_1
8 ай бұрын
Count action and transformation both hai. Aage ke lecture me clear ho jayega
pls enable subtitle bro
File ka data kaha se lu.. aapne data kaha diya hai
jo code snipet suru m dikhaya h usme count bhi ek job h right?
@manish_kumar_1
Ай бұрын
Nhi aage ke lecture me aapko pata chalega why
Hi Manish, Why doesn't the collect() method create a new stage (stage 4) in Job2 as it needs to send the data from 200 partitions into the driver node ?
@manish_kumar_1
Жыл бұрын
Collect is an action not a transformation
@venumyneni6696
Жыл бұрын
@@manish_kumar_1 Thanks for the reply Manish. What happens after the groupBy in this case ? Spark transfers the data in 200 partitions to driver right ? Don't we need any tasks for that process ? Thanks in advance.
@manish_kumar_1
Жыл бұрын
@@venumyneni6696 I think you are missing some key point of driver and executor. Please clear your basics, read multiple blogs or watch my videos in sequence
@AAMIRKHAN-qy2cl
7 ай бұрын
correct but you said that for every action 1 stage will be created so total stages should be 5 ,@@manish_kumar_1
Number of actions is equal to number of jobs. In mentioned code snipped there was thrre actions (read.count,collect) . As per theory three job id should create. But in spark ui only two job is created. Can you help me on this.
@rajukundu1308
Жыл бұрын
why three job id not credited?.
@manish_kumar_1
Жыл бұрын
Count is a transformation and action both. In the given example it is working as transformation not as an action. I will be uploading aggregation video soon. There you will get to know more about count behavior
@rajukundu1308
Жыл бұрын
@@manish_kumar_1 thanks for prompt response.sure, eagerly waiting for your new video
I have executed the same job, but it created 4 jobs with each having 1 stage and 1 task. I think for every wide transformation, it created a new job. Please please confirm and guide.
@villagers_01
27 күн бұрын
bhai job action pe create hota h but no. of action != no of jobs. kyonki jobs create hota h jab new rdd ka jarurat hota h. in groupby we need shuffling of data aur rdd immutable hota h to naya rdd banana hi padta h after shuffling. isliye jab bhi naya rdd banane ka jarurat hota h to ek job create hota.
Bro If i have created 3 partatition then it will create 3 tasks? does it correct?
@manish_kumar_1
Жыл бұрын
Yes
Manish Bhai, count() bhi ek action hai na? Uska alag job nahi bani q?
@manish_kumar_1
9 ай бұрын
Already bataya hai ek lecture me why count shows 2 kind of behavior
@amlansharma5429
9 ай бұрын
@@manish_kumar_1 sorry thoda peeche hogaya hu...catchup karta hu jaldi ♥️
Hi Thank you for sharing detail video. I tried same code on my laptop but I get total 5 jobs (0 to 4) and final task to just 1 even after the default partition setting is 200..could you please explain why I see multiple jobs in DAG and final task of just 1. Thank you once again.
@user-zd7fg5to7g
Жыл бұрын
and also I get same output on databrick, gcp and pycharm
@manish_kumar_1
Жыл бұрын
first of all use collect in place of show. Show ke case me i have seen it shows skipped job as a count. And ye 200 set karne ke baad kyu nahi kaam kar rha, i am not sure
@rpraveenkumar007
Жыл бұрын
experiencing same
@manish_kumar_1
Жыл бұрын
@@rpraveenkumar007 share me your code
@mahnoorkhalid6496
9 ай бұрын
@@manish_kumar_1 I am having the same issue, same code and it created 4 jobs each having 1 stage and 1 task. may be its because of wide transformation
Around 6:35 it is wrong, read can avoid being a action, if inside schema you pass a manual created schema containing a list of all columns. Refer for practical: kzread.info/dash/bejne/iICdm7mMaLawdrw.html
From Where I get the spark practical videos
@manish_kumar_1
3 ай бұрын
There is a playlist called spark practical
@prachideokar7639
2 ай бұрын
Hello I want 1:1 comunication..... Is it possible.. I have some doubts related to data engineer project@@manish_kumar_1
count() bhi to action h na sir?
@manish_kumar_1
8 ай бұрын
Aage ke lecture me aapko pata chalega. It is both action and transformation.
Bhai mujhe toh ye 3 jobs, 3 stages and 4 tasks dikha raha. Job 0 for load with 1 task --> Job 1 for collect with 2 task --> Job 3 for collect with 1 task but is showing skipped. Didn't get whats wrong used the same code data but used different data 7MB size.
@manish_kumar_1
8 ай бұрын
No need to get to rigid here. Spark bahut saare optimization karta hai and multiple time some jobs get skipped. Maine controlled environment me Kiya tha to show how does that work. During project development you are not going to count how many jobs,stages or task are there. So even if you don't get the same number just chill
count is also an action , why job not created for it??
@manish_kumar_1
2 ай бұрын
Count is an action and transformation both. Aage ke lectures me pata chal jayega
SO CAN WE SAY ACTION AND STAGE ARE SAME ?
@manish_kumar_1
11 ай бұрын
Fir Maine ye padhaya hi kyu🤔. In short no action and stage are 2 different thing. Please watch video carefully
203 task or 204 task plz clear this
Bhai count() bhi action hai na?
@manish_kumar_1
8 ай бұрын
Action v hai and transformation v. Aage ke videos me mil jayega ye