Master Reading Spark DAGs

Spark Performance Tuning
In this tutorial, we dive deep into the core of Apache Spark performance tuning by exploring the Spark DAGs (Directed Acyclic Graph).
We cover the Spark DAGs (Directed Acyclic Graph) for a range of operations from reading files, Spark narrow and wide transformations with examples, aggregation using groupBy count, groupBy count distinct. Understand the differences between sort merge and broadcast joins, and analyze the DAG from different perspectives with practical examples.
This video is a treasure trove for both beginners and experienced Spark users looking to optimize their code and understand the inner workings of Apache Spark. We examine the DAG, input batches, and partitions in great detail, understand the significance of metadata, and explore how Spark optimizes the execution of jobs and stages.
📄 Complete Code on GitHub: github.com/afaqueahmad7117/sp...
🎥 Full Spark Performance Tuning Playlist: • Apache Spark Performan...
🎥 Link to Spark Query Plan Video: • Master Reading Spark Q...
🔗 LinkedIn: / afaque-ahmad-5a5847129
Chapters:
00:00 Introduction
00:34 Module imports
00:51 Topics covered
01:54 Spark DAG for Reading a file
07:36 DAG for Narrow transformations
11:17 Wide transformations introduction
11:24 DAG for Sort Merge join (wide transformation)
18:30 DAG for Broadcast join (narrow transformation)
20:15 DAG for Aggregations Group by count (wide transformation)
24:41 DAG for Aggregations Group by sum (wide transformation)
25:44 DAG for Aggregations Group by count distinct (wide transformation)
#ApacheSpark #SparkPerformanceTuning #DataEngineering #SparkDAG #SparkOptimization

Пікірлер: 56

  • @HarbeerKadian-m3u
    @HarbeerKadian-m3u17 күн бұрын

    Amazing. This is just too good. Will share with my team also.

  • @gabriells9074
    @gabriells90747 ай бұрын

    this is probably the best explanation I've seen on spark DAG's. Please keep up the amazing content! thank you

  • @niladridey9666
    @niladridey966611 ай бұрын

    again in depth content. Thanks a lot. Please discuss a scenario based question on todays topics.

  • @OmairaParveen-uy7qt
    @OmairaParveen-uy7qt11 ай бұрын

    Explained so well!! Crystal clear!

  • @Fullon2
    @Fullon211 ай бұрын

    Nice serie about performance, waiting for more videos, tranks.

  • @saravananvel2365
    @saravananvel236511 ай бұрын

    amazing explanination ..Waiting for more videos from you

  • @ManaviVideos
    @ManaviVideos11 ай бұрын

    It's really informative session, thank you!!

  • @afaqueahmad7117
    @afaqueahmad711711 ай бұрын

    🔔🔔 Please remember to subscribe to the channel folks. It really motivates me to make more such videos :)

  • @lunatyck05

    @lunatyck05

    9 ай бұрын

    Done - awesome videos will watch the rest of the series. Would be great to get some databricks oriented videos also when possible

  • @BuvanAlmighty
    @BuvanAlmighty8 ай бұрын

    Beautiful content. Very clear and crystal explanation. Thank you for doing this. ❤❤

  • @afaqueahmad7117

    @afaqueahmad7117

    8 ай бұрын

    Glad you enjoyed it!

  • @yuvrajyuvas4730
    @yuvrajyuvas47305 ай бұрын

    Bro..Can't thank you enough... This is what exactly I was looking... Thanks a ton bro... 🎉

  • @nayanroy13
    @nayanroy1310 ай бұрын

    awesome explanation.👍

  • @user-ye6ke9er9d
    @user-ye6ke9er9d4 ай бұрын

    Doing fantastic work bro.... Keep this up 💪❤

  • @CharanSaiAnnam
    @CharanSaiAnnam4 ай бұрын

    very good explanation, thanks. you earned a new subscriber

  • @zulqarnainali6560
    @zulqarnainali656011 ай бұрын

    beautifully explained!

  • @varunparuchuri9544
    @varunparuchuri95442 ай бұрын

    @Afaque asually amazing vedio bro. It's been more than 1 month we are dying of waiting for vedios from you

  • @afaqueahmad7117

    @afaqueahmad7117

    2 ай бұрын

    A new playlist coming soon brother :)

  • @pachamattlajyothi249

    @pachamattlajyothi249

    21 күн бұрын

    @@afaqueahmad7117 waiting

  • @CoolGuy
    @CoolGuy9 ай бұрын

    Done with the second video on this channel. See you tomorrow again.

  • @i_am_out_of_office_
    @i_am_out_of_office_4 ай бұрын

    very well explained!!

  • @Learner1234-hv4be
    @Learner1234-hv4be3 ай бұрын

    Great explanation bro,thanks for the great work you are doing

  • @afaqueahmad7117

    @afaqueahmad7117

    2 ай бұрын

    Thank you @Learner1234-hv4be, really appreciate it :)

  • @balakrishna61
    @balakrishna612 ай бұрын

    Nice explanation.Great work.Thank you .Liked and Subscribed.

  • @afaqueahmad7117

    @afaqueahmad7117

    2 ай бұрын

    Thank you @balakrishna61, appreciate it :)

  • @muhammadhassan1640
    @muhammadhassan164010 ай бұрын

    Excellent bro

  • @viswanathana3759
    @viswanathana37593 ай бұрын

    Amazing content. Keep it up

  • @afaqueahmad7117

    @afaqueahmad7117

    3 ай бұрын

    Appreciate it :)

  • @ankursinhaa2466
    @ankursinhaa24669 ай бұрын

    Thank you Bro!! your videos are very informative and helpful. Can you please one video explaining setting up spark in local machine. That will be very helpful

  • @afaqueahmad7117

    @afaqueahmad7117

    9 ай бұрын

    Thanks @ankursinhaa2466, videos on deployment (local and cluster) coming soon :)

  • @pratiksatpati3096
    @pratiksatpati309611 ай бұрын

    Superb ❤

  • @subaruhassufferredenough7892
    @subaruhassufferredenough78927 ай бұрын

    Could you also do a video on Spark SQL and how to read DAGs/Execution Plans for that? Amazing video btw, subscribed!!

  • @afaqueahmad7117

    @afaqueahmad7117

    7 ай бұрын

    Hey @subaruhassufferredenough7892, Thanks you for the kind words, really appreciate it :) On Spark SQL, DAGs/Execution plans for both Spark SQL and non-SQL (python) are the same as they are compiled/optimized by the same underlying engine/catalyst optimizer.

  • @SHUBHAM_707
    @SHUBHAM_707Ай бұрын

    Please make a dedicated video on shuffle partition... how it behaves when it's increased or decrease from 200

  • @afaqueahmad7117

    @afaqueahmad7117

    Ай бұрын

    Hey @SHUBHAM_707, have you watched this - kzread.info/dash/bejne/o2WA1qSOj8bHYpM.html

  • @tejasnareshsuvarna7948
    @tejasnareshsuvarna794814 күн бұрын

    Thank you very much for the explanation. But I want to know what is your source of knowledge. Where do you learn these things?

  • @jdisunil
    @jdisunil6 ай бұрын

    your expertise and explanations is like "filtered gold in one can " Can you make quick video on AQE in depth please. 1000 thanks

  • @afaqueahmad7117

    @afaqueahmad7117

    6 ай бұрын

    Thanks @jdisunil for the kind words. There's already an in-depth video on AQE. You can refer here: kzread.info/dash/bejne/lIaeuMNwfcrZcrA.html

  • @tahiliani22
    @tahiliani223 ай бұрын

    At 16:49, as part of the AQE plan for the larger dataset, the way that I understood is 1 skewed partition was split in 12 and finally we had 24+12 = 36 partitions. We see the same on Job Id 9 at 13:40 that it had 36 tasks. But I heard you say that 36 partitions have been reduced to 24. Can you please help clear the confusion ? thank you.

  • @rambabuposa5082

    @rambabuposa5082

    3 ай бұрын

    I think in that AQE Step, AQEShuffleRead reads 200 partitions (as per previous node) from customers dataset, then coalesced to 24 then something happened and make them to 36 thats why that right side node is showing "number of of partitions 36". At left side for transactions dataset, this "number of of partitions 36" is appearing as last value where at right side for customers dataset its appearing as first value. But Im not sure what is that " something"???

  • @ComedyXRoad
    @ComedyXRoad2 ай бұрын

    thank you

  • @afaqueahmad7117

    @afaqueahmad7117

    2 ай бұрын

    Appreciate it :)

  • @TechnoSparkBigData
    @TechnoSparkBigData11 ай бұрын

    Thanks for this. When is the next video coming sir?

  • @afaqueahmad7117

    @afaqueahmad7117

    10 ай бұрын

    Coming soon in the next few days! :)

  • @rambabuposa5082
    @rambabuposa50823 ай бұрын

    Hi Afaque Ahmad At 7:24, you were saying that a batch is a group of rows and its not same as a partition. Shall we assume something like a group of rows read from one or more partitions available in one or more executors (not from all executors) to match that df.show() count?

  • @rambabuposa5082
    @rambabuposa50823 ай бұрын

    Hi Afaque Ahmad At 13:37 you were saying that separate job for shuffle operation that one job for transactions dataset shuffle operation and one for customers dataset. Im bit confused why they need a separate job? As per my understanding, when spark encounters a shuffle operation, it just creates a new stage within that job right? When I execute the same code snippet, it create 5 jobs totally: two for metadata (expected), two for shuffle operation (not expected) and final one is for join operation. Many thanks

  • @RishabhTakkar-o6l
    @RishabhTakkar-o6l10 күн бұрын

    How do you access this spark UI?

  • @tandaibhanukiran4828
    @tandaibhanukiran48284 ай бұрын

    Hello Bro, I have a doubt. at "23:30 min" playtime, it was mentioned that AQEShuffleRead: coalesced partitions into 1, then will the other worker nodes will sit ideal ? In the Video it is mentioned that even after shuffle, all A's will be in 1 partition and B's in another partition. can you please explain me, what do you actually mean by Number of Coalesced Partitions=1

  • @abdulraheem2874
    @abdulraheem287410 ай бұрын

    Bhai , can help to make a video on spark architecture as well for beginners

  • @afaqueahmad7117

    @afaqueahmad7117

    10 ай бұрын

    Haan bhai, ayega kuch time mein :)

  • @satheeshkumar2149
    @satheeshkumar21495 ай бұрын

    While stages are created whenever a shuffle occurs, how are jobs created?

  • @afaqueahmad7117

    @afaqueahmad7117

    5 ай бұрын

    Hey @satheeshkumar2149, jobs are created whenever an actions is invoked. Examples of action in Apache Spark can be - collect(), count()

  • @satheeshkumar2149

    @satheeshkumar2149

    5 ай бұрын

    @@afaqueahmad7117 , but in some cases we have more than one job being created. This is where I find difficulty in understanding

  • @user-dx9qw3cl8w
    @user-dx9qw3cl8w8 ай бұрын

    why shuffle partitions made 200 hundread. when we have only 13 partitions max. at 14:55

  • @afaqueahmad7117

    @afaqueahmad7117

    8 ай бұрын

    By default, shuffle partitions are 200, hence you see that in the 'Exchange' step. The reduction (optimization) to fewer partitions takes place in the 'AQEShuffleRead' step below.

  • @NiranjanAnandam
    @NiranjanAnandamАй бұрын

    No clarity is provided on when job is created. The stages are result of shuffle. The task is just a unit of execution