RDD vs DataFrame vs Datasets | Spark Tutorial Interview Questions

As part of our spark Interview question Series, we want to help you prepare for your spark interviews. We will discuss various topics about spark like Lineage, reduceby vs group by, yarn client mode vs yarn cluster mode etc. As part of this video we are covering
difference between rdd , dataframe and datasets.
Please subscribe to our channel.
Here is link to other spark interview questions
• Spark Interview Questions
Here is link to other Hadoop interview questions
• 1.1 Why Spark is Faste...

Пікірлер: 74

  • @ganeshdhareshwar6053
    @ganeshdhareshwar60534 жыл бұрын

    nicely explained. Thank you for your effort on gathering information and publishing it. much needed videos it is

  • @apekshatrivedi8689
    @apekshatrivedi86893 жыл бұрын

    Very nice explanation. Your videos really help me while preparing for interviews. Highly recommend. Thank you!

  • @bhargavhr1891
    @bhargavhr18916 жыл бұрын

    Again an very nice video, thanks and it would be great if you provide a pseudo code or simple code sytax for each abstractions so that understanding will be very clear

  • @TusharKakaiya
    @TusharKakaiya3 жыл бұрын

    Really helpful content. Much appreciated.

  • @rameshgangabathula6221
    @rameshgangabathula62214 жыл бұрын

    Nice explanation. Can you please explain how to do check pointing & resume a failed spark job(due to action/transformation failure and executor memory exceeded) in another video?

  • @someshmungikar4466
    @someshmungikar44663 жыл бұрын

    cooollll great answer sir... thanks !!!

  • @ravinderkarra3187
    @ravinderkarra31876 жыл бұрын

    DataFrame also serialize the data into off-heap storage in binary format and then perform transformations directly on off heap memory as spark understands the schema. Also provides a Tungsten physical execution back-end which explicitly manages memory and dynamically generates byte-code for expression evaluation. So does memory management better here.

  • @akp7-7

    @akp7-7

    Жыл бұрын

    yes ,so dataframe is more fast as compare to dataset?

  • @RahulRawat-wu1vv
    @RahulRawat-wu1vv5 жыл бұрын

    It will serialize the data or deserialize coz as far as i know we deserialization is conversion of byte stream to. Java object. Please correct if i am wrong.

  • @souravsinha5330
    @souravsinha5330 Жыл бұрын

    Nice and clear explanation. To the point thanks.

  • @DataSavvy

    @DataSavvy

    Жыл бұрын

    Glad it was helpful!

  • @rahulshandilya880
    @rahulshandilya8804 жыл бұрын

    When to use dataframe and when to use dataset and when to use Rdd and spark sql, sparkSession

  • @shubhamkumar-uz7ux
    @shubhamkumar-uz7ux5 жыл бұрын

    Very informative ..just one thing voice is too low in video .

  • @max6447
    @max64473 жыл бұрын

    Thanks your videos are very useful !

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Thanks Mayank :)

  • @naresh5273
    @naresh52735 жыл бұрын

    Thank you. Last time in my interview, interviewer asked me same question...

  • @DataSavvy

    @DataSavvy

    5 жыл бұрын

    Thanks Kartik... I am happy this content was useful to you... can you share other questions asked by your interviewer?

  • @nehabansal677
    @nehabansal6775 жыл бұрын

    Great content... Very helpful for interviews

  • @DataSavvy

    @DataSavvy

    5 жыл бұрын

    Thanks ... Please watch full spark interview series

  • @RajKumar-zw7vt
    @RajKumar-zw7vt5 жыл бұрын

    Nice video bro...

  • @DataSavvy

    @DataSavvy

    5 жыл бұрын

    Thanks Raj...

  • @arundhingra4536
    @arundhingra45365 жыл бұрын

    @Data Savvy - A small correction, at 8:10 you mentioned that we cannot do map, join and other operations on a DataFrame

  • @sharathchandra5314

    @sharathchandra5314

    5 жыл бұрын

    Data Savvy should have said that if we use Map, Join and other operations that take HigherOrderFunctions then lets forget Optimization by Spark Framework.

  • @TheBjjninja
    @TheBjjninja5 жыл бұрын

    Can you fix your volume please

  • @anilcvs1
    @anilcvs16 жыл бұрын

    Please show me some real times scenarios in videos.

  • @DataSavvy

    @DataSavvy

    6 жыл бұрын

    Hi Anil, thanks for comment... Give me some example, I will create video for that ... Please subscribe to channel

  • @akshathab.s6751

    @akshathab.s6751

    5 жыл бұрын

    Hi real time scenario like industry level data processing I mean for performance tunning when their is large amount of data to be process, like wise which component to be preffered like dataframe r daraset or rdd... In what suitation which methodology is suitable.

  • @Pratik0917
    @Pratik09174 ай бұрын

    Then people arent using dataset everywhere?

  • @ajaypratap4025
    @ajaypratap40255 жыл бұрын

    When to use dataframe and when to use dataset?

  • @owaisshaikh3983

    @owaisshaikh3983

    3 жыл бұрын

    when you have strict data type use data frame (more convenient) or else dataset

  • @SuperSazzad2010
    @SuperSazzad20105 жыл бұрын

    Hi Please throw some light on the fact that DataFrame make use of Java Serialization.. But What is the use of off-heap?

  • @DataSavvy

    @DataSavvy

    5 жыл бұрын

    Dataframe can use java serialization or kryo... Off Heap is used for shuffle

  • @akp7-7

    @akp7-7

    Жыл бұрын

    @@DataSavvy So dataframe uses java serialization or it is used by dataset?

  • @rakeshkumarsharma3920
    @rakeshkumarsharma39205 жыл бұрын

    How can we automate incremental import in SQOOP?

  • @lokeshmvs

    @lokeshmvs

    5 жыл бұрын

    Use "Sqoop job --create " create a job for incremental import, so that sqoop job will store the meta data of incremental load.

  • @rakeshkumarsharma3920

    @rakeshkumarsharma3920

    5 жыл бұрын

    @@lokeshmvs I am asking if I have to do incremental import for 50 table, and that job will get execute at mid night .then how do I archive it . Please let me know with examples

  • @chiranjeevikatta8116
    @chiranjeevikatta81163 жыл бұрын

    I am new to the spark and big data world. I choose to use/learn pyspark because I am familiar with python. I got to know the python is not type-safe and does not support for datasets. Can someone say does pyspark is used in building real-world applications Or Do I need to learn scala/java. Thanks. -Great video

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    PySpark is used for lot of real time projects... I have generally seen people doing ml or data analysis project using pyspark... Data ingestion teams use scala... However this is not always true... It usually boils down to comfort level of developer and team composition

  • @chiranjeevikatta8116

    @chiranjeevikatta8116

    3 жыл бұрын

    @@DataSavvy thank you. Good work. I took online courses but I got more clarity after watching your videos.

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Thanks Chiranjeevi... Very happy to hear that :)

  • @yeoreumkwon
    @yeoreumkwon5 жыл бұрын

    If I understood correctly, PySpark does not support the Datasets because Python is not a type-safe language, right?

  • @DataSavvy

    @DataSavvy

    5 жыл бұрын

    You are right my friend... Data set philosophy is different from philosophy of python language

  • @yeoreumkwon

    @yeoreumkwon

    5 жыл бұрын

    ​@@DataSavvy So I will have to learn Scala for using Spark Datasets. Thank you very much for your effort. I really enjoy your series.

  • @DataSavvy

    @DataSavvy

    5 жыл бұрын

    If u are particular about using dataset then yes.. use a JVM language Scala or java... I am happy that you like the series... Please suggest more topics that you are interested in

  • @ambikaiyer29
    @ambikaiyer295 жыл бұрын

    Hi - Can you please share details on why dataset api is not available in Python?

  • @DataSavvy

    @DataSavvy

    5 жыл бұрын

    Because Python is not type safe ... And datasets are type safe

  • @vinodmani3900

    @vinodmani3900

    5 жыл бұрын

    Thanks @@DataSavvy . I was behind this for some time , wondering why all the API are for dataframe in PySpark. So it means in PySpark we need to code with dataframe itself right ?

  • @luckyomprakash8437
    @luckyomprakash84375 жыл бұрын

    What questions one can ask to check Spark RDD experience?

  • @meswapnilspal
    @meswapnilspal5 жыл бұрын

    volume is very low

  • @iftekharkhan3254
    @iftekharkhan325410 ай бұрын

    Sound is low

  • @dharmendrabhojwani
    @dharmendrabhojwani5 жыл бұрын

    Very low voice.

  • @akashchaudhary6953
    @akashchaudhary69532 жыл бұрын

    sir mera interview leke mujhe famous ker do.. 💌

  • @raviyadav-dt1tb
    @raviyadav-dt1tb7 ай бұрын

    Please provide aws questions and answers

  • @DataSavvy

    @DataSavvy

    6 ай бұрын

    Planning that

  • @raviyadav-dt1tb

    @raviyadav-dt1tb

    6 ай бұрын

    @@DataSavvy can you please provide interview questions from scala program, several times I getting rejections due to scala program

  • @DataSavvy

    @DataSavvy

    6 ай бұрын

    I will add that in my list... Need to work on when can i start that

  • @raviyadav-dt1tb

    @raviyadav-dt1tb

    6 ай бұрын

    @@DataSavvy please do thank you 🙏

  • @DataSavvy

    @DataSavvy

    6 ай бұрын

    Thank you

  • @alexanderkorchagin67
    @alexanderkorchagin674 жыл бұрын

    ERROR! Actualy Dataframe, Dataset, RDD - it is correct order of performance from very effective to not effective. DF is better performance then DS because not using serialization and desirialization when work with data

  • @DataSavvy

    @DataSavvy

    4 жыл бұрын

    Hi Alexander, Excuse me if explanation was not clear. Message was that DS uses encoders which are more efficient way of serializing and deseriailizing data than kryo or default serialization... so shuffle operations will be more efficient as they involve serialization and deseriailizing. In general there is not much difference in performance of data frame and dataset these days...could you elaborate on df not using serialization and deseriailization... I did not get what u meant there

  • @akp7-7

    @akp7-7

    Жыл бұрын

    @@DataSavvy i recently learnt that in DF serialization is managed by tungsten binary format..encoders however in DS serialization is managed by java serialization.so DF performance is little fast than DS

  • @knightganesh
    @knightganesh4 жыл бұрын

    Voice is very low please work on it

  • @DataSavvy

    @DataSavvy

    4 жыл бұрын

    You are right.. I have improved in New videos

  • @5669ashish
    @5669ashish5 жыл бұрын

    Bhai Hindi me video kar sakte ho kya ik hi time me samajh aa sakta hai

  • @DataSavvy

    @DataSavvy

    5 жыл бұрын

    Sab ko Hindi samajh Nahi ayegi Bhai... Vaise kya samajh Nahi Aya ek time mein... I can help

  • @sergeibatiuk3468
    @sergeibatiuk34684 жыл бұрын

    How many times he said 'you know'

  • @mikecmw8492
    @mikecmw84926 жыл бұрын

    Please remake this video with real examples. for example, open a spark2 REPL and load a file full of data. Show how to create RDD, DF, DS. Then show some operations with each. Having just text on the screen will not help in an interview. Most interviews are now hands on, especially with big data. Thank you

  • @DataSavvy

    @DataSavvy

    5 жыл бұрын

    sure Mike, will do this... adding this in my next steps.. I appreciate these suggestions.

  • @rishabhjain8558
    @rishabhjain85583 жыл бұрын

    You talk extra, not from the topics . Also your concepts are not clear. And try to give some examples. Always showing ppt and dictating.

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Thanks for your Wise comments... Will try to Improve...