Data Engineering Interview | Apache Spark Interview | Live Big Data Interview

This video is part of the Spark Interview Questions Series.
A lot of subscribers has requested me to give some experience on how an actual Big Dta Interview look like. In This Video we have covered what usually happens in Big data or data engineering interview happens.
There will be more videos covering different aspects of Data Engineering Interviews.
Here are a few Links useful for you
Git Repo: github.com/harjeet88/
Spark Interview Questions: • Spark Interview Questions
If you are interested to join our community. Please join the following groups
Telegram: t.me/bigdata_hkr
Whatsapp: chat.whatsapp.com/KKUmcOGNiix...
You can drop me an email for any queries at
aforalgo@gmail.com
#apachespark #sparktutorial #bigdata
#spark #hadoop #spark3 #bigdata #dataengineer

Пікірлер: 233

  • @tradingtransformation346
    @tradingtransformation346 Жыл бұрын

    Questions : 1) Why you shifted from map reduce development to spark development? 2) How Spark Engine is different from Hadoop Map Reduce engine? 3) What are the steps for spark jobs optimization? 4) What is executor and executor core? Reference in terms of process & threads 5) How to you identify that your hive script is slow? 6) When do we use partitioning and bucketing in hive? 7) Small file problem in hive ? ---> Skewiness 8) How do you improve high cardinality issue in dataset? In resect of Hive. 9) How do you care code merging with other teams, explain your development process? 10) Again, Small files issue in Hadoop ? 11) Metadatasize of hadoop ? 12) How spark is differentiated from Map Reduce? 13) In a class having 3 fields name,age,salary & you are creating series of objects from this class? How do you compare the object ----(I didn't got the question exactly) 14) Scala : what is === in joins conditions? What does it means? Hope so it will help?

  • @BigDataWithSky

    @BigDataWithSky

    Жыл бұрын

    Yes Thank you 👍

  • @chandanpradhan5389

    @chandanpradhan5389

    Жыл бұрын

    Thank you,..

  • @jana-kl7hv

    @jana-kl7hv

    Жыл бұрын

    Thank you

  • @vigneshjanarthanan6514

    @vigneshjanarthanan6514

    Жыл бұрын

    13 question is not clear even to me

  • @bramar1278
    @bramar12783 жыл бұрын

    I must really appreciate for posting this interview in public domain. This is a really good one.. it would be really great to see a video on process to optimize the job

  • @Nasirmah
    @Nasirmah Жыл бұрын

    Thank you guys, you are big reason why I got job in aws data Engineer. Spark and optimizations are most asked questions. Partitioning and bucketing with Hive as well. I would also add that the interviewers are similar to real setting because they usually point to you to the right direction of the answer they looking so always listen to their follow up.

  • @karna9156

    @karna9156

    Жыл бұрын

    How you are feeling now do you transitioned your career from some other tech ..? Do you face complexities in your day to day activities..?

  • @AhlamLamo
    @AhlamLamo3 жыл бұрын

    amazing job , really interesting thank you for sharing this interview with us.

  • @sukanyapatnaik7633
    @sukanyapatnaik76333 жыл бұрын

    Awesome video. Thank you for putting this out. It's helpful.

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Thanks Sukanya

  • @ramkumarananthapalli7151
    @ramkumarananthapalli71513 жыл бұрын

    Very much helpful. Thanks a lot for uploading.

  • @MageswaranD
    @MageswaranD3 жыл бұрын

    How do you optimize a job? - Check the input data size and output data size and correlate to operating cluster memory - Check Input partition size,output partition size and number of partitions along with shuffle partition and decide number of cores - Check for disk memory spills during stage execution - Number of executors used for given cluster size - Available cluster memory and memory in use by the application/job - Check average run time of all stages in the job, to identify any data skewed stage tasks - Check whether the table is partitioned by column or not / bucketing

  • @eugenetitus9370

    @eugenetitus9370

    2 жыл бұрын

    Instablaster.

  • @ankushbisen1185

    @ankushbisen1185

    2 жыл бұрын

    perfect!

  • @naraharisettysai5059

    @naraharisettysai5059

    2 жыл бұрын

    Great..

  • @shaikshavalikadapa619

    @shaikshavalikadapa619

    Жыл бұрын

    Hello . Can you guide me where I can learn all these concepts

  • @neelbanerjee7875

    @neelbanerjee7875

    Жыл бұрын

    Is it possible to explain all of this in details? Best with an example

  • @sujaijain4511
    @sujaijain45112 жыл бұрын

    Thank you very much, this is very useful!!!

  • @amansinghshrinet8594
    @amansinghshrinet85943 жыл бұрын

    @Data Savvy It can be watched at one stretch. Really helpful. 👍🏻🙌🏻

  • @mayanksrivastava4121
    @mayanksrivastava4121 Жыл бұрын

    Amazing .. thanks @Data Savvy for your efforts :)

  • @rohitrathod8150
    @rohitrathod81503 жыл бұрын

    Awesome Harjeet sir!! I can even watch such thousand videos at a stretch😁 Very informative!!! Can't wait for long, please upload as much as u can sir.

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Thanks Rohit... Yes, I will try my best :)

  • @JaiGurujiAshwi
    @JaiGurujiAshwi2 жыл бұрын

    Hi sir, it's really helpful for me because I have issues lots of questions which you asked there, thank you so much sir. Please make one more videos on advance level SPRK series please.

  • @deepikalamba1058
    @deepikalamba10583 жыл бұрын

    Hey, It was really helpfull Thank you 👍

  • @sathyansathyan3213
    @sathyansathyan32133 жыл бұрын

    Keep up the excellent work👍 expecting more such videos.

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Thanks :)

  • @chaitanya5869
    @chaitanya58693 жыл бұрын

    Ur interview is very helpful. Keep up the good work 👍👍👍

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Thanks Chaitanya :)

  • @rahulpandit9082
    @rahulpandit90823 жыл бұрын

    Very Informative.. Thanks a lot Guys...

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Thanks Rahul... Sathya and Arindham helped with this :)

  • @vidyac6775
    @vidyac67753 жыл бұрын

    i like all videos of yours :) very informative

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Thanks Vidya... I am happy that these videos are useful for you :)

  • @MoinKhan-cg8cu
    @MoinKhan-cg8cu3 жыл бұрын

    Very 2 helpful nd plz have 1,2 more interviews of same level. Great effort by interviewer and interviewee.

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Thanks for your kind words... Yup more interviews are planned

  • @surajyejare1627
    @surajyejare16272 жыл бұрын

    This is really helpful

  • @kranthikumarjorrigala
    @kranthikumarjorrigala3 жыл бұрын

    This is very useful. Please make more videos like this.

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Thanks Kranthi... We will create more videos like this

  • @nivedita5639
    @nivedita56393 жыл бұрын

    Thank you sir ..it is very helpful

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Thanks Nivedita

  • @ajithkannan522
    @ajithkannan5222 жыл бұрын

    since this is mock interview at the end the interviewers should hv given feedback in the call itself so its helpful for viewers

  • @rohitkamra1628
    @rohitkamra16283 жыл бұрын

    Awesome. Keep it up 👍🏻

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Thanks mate

  • @tradingtexi
    @tradingtexi3 жыл бұрын

    really great video, it would have been much greater, if you can answer the questions which the candidate was not able to answer, like what are symptoms of a job, on which you will decide that you should increase the number of executors or memory per executors. Can anyone please answer here, so that it may be beneficial for candidates. Thanks a lot for this video.

  • @shivankchaturvedi210

    @shivankchaturvedi210

    Жыл бұрын

    Bhai he has already made a video how to set executor config see that video you will get the answer

  • @shubhamshingi4657
    @shubhamshingi46573 жыл бұрын

    It would be really helpful if you could make more such a mock interviews. I think we have only 3 live interviews yet on channel

  • @kaladharnaidusompalyam851
    @kaladharnaidusompalyam8513 жыл бұрын

    hadoop is meant for handling big files in small numbers and also small file problem arises when file size is less than HDFS block size [ 64 or 128 ] . Moreover, handling bulk number of small files may increase pressure on Name node , if we have option to handle big file. so in hadoop file size matters alot so only Partitioning and Buckting came into picture. correct me if i did mistake

  • @sank9508

    @sank9508

    3 жыл бұрын

    Partitioning and Bucketing is related to YARN ( processing side of Hadoop ) HDFS small files explained : blog.cloudera.com/the-small-files-problem/ ( storage side of Hadoop ) Also to handle huge number of small files, we need to increase NN heap (1Million blocks count-> 1GB) there then causing GC issue and making things complicated.

  • @davidgupta110
    @davidgupta1103 жыл бұрын

    Good logical questions 👌👌

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Thanks David :)

  • @abhinee
    @abhinee3 жыл бұрын

    Also spark does dynamic executor allocation so you dont need to pass 800 executors as input. Size your job by running test loads.

  • @venkataramanak8264
    @venkataramanak82642 жыл бұрын

    In spark it is not possible to apply Bucketing without partitioning the tables. So If we do not find a suitable column to partition the table, how will we proceed a head with optimization ?

  • @kashifanwar4034
    @kashifanwar40343 жыл бұрын

    How can we make only name as a deciding factor to remove duplicity in a set instead of all the entries it take in Scala?

  • @rajeshkumardash1222
    @rajeshkumardash12223 жыл бұрын

    @Data Savvy Nice one very informative

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Thanks Rajesh... More videos like this will be posted

  • @enishaeshwar7617
    @enishaeshwar76172 жыл бұрын

    Good questions

  • @sssaamm29988
    @sssaamm299882 жыл бұрын

    what is the answer to the scala question at 31:00, eliminating duplicate objects in a set on the basis of name?

  • @arindampatra6283
    @arindampatra62833 жыл бұрын

    Wish I didn't have the haircut that day😂😂😀😀😂😂😂

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    This is fine Bro :)

  • @anudeepyerikala8517

    @anudeepyerikala8517

    3 жыл бұрын

    arindam patra the king of datasavvy

  • @kiranmudradi26
    @kiranmudradi263 жыл бұрын

    Nice video. The purpose of using '===' while joining is to make sure that we are comparing right values (join key value) and right data type as well. Please correct me if my understanding is wrong.

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    You are right... Using more keywords here will help in giving better answer

  • @deekshithnag1655

    @deekshithnag1655

    Жыл бұрын

    using 3 equals (===) is a method defined in column class in scala that is specifically designed to compare columns in dataframes.

  • @Varnam_trendZ

    @Varnam_trendZ

    8 ай бұрын

    ​@@deekshithnag1655Hi.. are you working as a data engineer?

  • @RahulRawat-wu1vv
    @RahulRawat-wu1vv3 жыл бұрын

    Can you do interview questions on scala. I believe these are really imp for cracking tough interviews

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Yes Rahul... I will plan for that

  • @saeba7528
    @saeba75283 жыл бұрын

    sir can you please make a video on which language is best for Data engineering is it scala or python?

  • @Karmihir
    @Karmihir3 жыл бұрын

    This is good but its just basics questions for DE, it would be great if you share some code and advanced logic questions for DE daily uses.

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    There will be more videos... With more complex problems.

  • @vijeandran
    @vijeandran3 жыл бұрын

    Hi Data Savvy, unable to join Telegram, authorization issue....

  • @digwijoymandal5216
    @digwijoymandal52162 жыл бұрын

    Seems like most questions are on how to optimize the jobs. Not much on the technical side. Does Data engineer interviews goes like this, or any other technical questions are asked too?

  • @msbhargavi1846
    @msbhargavi18463 жыл бұрын

    Hi, we can use distinct method in Scala for reading unique name rt??

  • @arindampatra6283

    @arindampatra6283

    3 жыл бұрын

    Absolutely.

  • @KahinBhiKuchBhi
    @KahinBhiKuchBhi Жыл бұрын

    We can use Bucketing when there are lot of small files ... Correct me if i wrong...

  • @chetankakkireni8870
    @chetankakkireni88703 жыл бұрын

    Sir, can you please do more interviews like this as it is helpful ..

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Yes Chetan, I am already planning for more videos on this

  • @shivamgupta-bc7fn
    @shivamgupta-bc7fn3 жыл бұрын

    Can you guys tell which companies would be interviewing in this pattern?

  • @tusharsonawane3055
    @tusharsonawane30553 жыл бұрын

    Hello sir this first time I am getting touch with you . It was a great interview I have seen so many tricky questions . I am preparing for spark Administrator interview do you have some spark tunning interview questions and some advance interview questions related to spark

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Hi Tushar... I am happy this video is useful for you. There is a playlist, on my channel for spark performance tuning and I am also working on a new series... Let me know if u need anything extra

  • @ShashankGupta347
    @ShashankGupta3472 жыл бұрын

    Default block size is 128MB, when small size files will be created using partitioning. Lot of storage will go waste. And required horizontal Scaling ( fails the purpose of distribution)

  • @kartikeyapande5878

    @kartikeyapande5878

    Ай бұрын

    But we can configure block size aswell right?

  • @ShashankGupta347

    @ShashankGupta347

    Ай бұрын

    @@kartikeyapande5878 Yes, When dealing with small files in a distributed storage system with a default block size of 128MB, indeed, there can be inefficiencies and wasted storage space due to the space allocated for each block. This issue is commonly known as the "small files problem" in distributed storage systems like Hadoop's HDFS. Here are a few strategies to mitigate this problem: 1. **Combine Small Files**: One approach is to periodically combine small files into larger ones. This process is often referred to as file compaction or consolidation. By combining multiple small files into larger ones, you can reduce the overhead of storing metadata for each individual file and make better use of the storage space. 2. **Adjust Block Size**: Depending on your workload, you might consider adjusting the default block size. While larger block sizes are more efficient for storing large files, smaller block sizes can be more suitable for small files. However, this adjustment requires careful consideration since smaller block sizes can increase the overhead of managing metadata and may impact performance. 3. **Use Alternate Storage Solutions**: Depending on your specific requirements, you might explore alternative storage solutions that are better suited for managing small files. For example, using a distributed object storage system like Amazon S3 or Google Cloud Storage might be more efficient for storing and retrieving small files compared to traditional block-based storage systems. 4. **Metadata Optimization**: Optimizing the metadata management mechanisms within your distributed storage system can help reduce the overhead associated with storing small files. Techniques such as hierarchical namespace structures, metadata caching, and efficient indexing can improve the performance and scalability of the system when dealing with small files. 5. **Compression and Deduplication**: Employing compression and deduplication techniques can help reduce the overall storage footprint of small files. By compressing similar files or identifying duplicate content and storing it only once, you can optimize storage utilization and reduce wastage. 6. **Object Storage**: Consider using object storage solutions that are designed to efficiently store and retrieve small objects. These systems typically offer features such as fine-grained metadata management, scalable architectures, and optimized data access patterns for small files. Each of these strategies has its own trade-offs in terms of complexity, performance, and overhead. The most suitable approach depends on the specific requirements and constraints of your application and infrastructure.

  • @bhavaniv1721
    @bhavaniv17213 жыл бұрын

    Can please explain roles and responsibilities of spark and scala

  • @biswadeeppatra1726
    @biswadeeppatra17263 жыл бұрын

    Can you please suggest any correct way to determine executor cores n executor memory by looking at the input data. Without hit n trail and instead going for that thumb rule that we have assuming 5 would be the optimized number for core.. Any other way to determine

  • @sank9508

    @sank9508

    3 жыл бұрын

    It purely depends on size of input data and kind of processing/computation like heavy join or simple scan of data. In general, worker nodes (data nodes) of size {Cores 40 (80 w/ Hyperthreading) ; Memory 500 GiB} then ~ 1vCore for every 5GB.

  • @MrRemyguy
    @MrRemyguy3 жыл бұрын

    I'm moving from web development to spark development. Any inputs on that please !! Can I switch without any experience of working with spark.

  • @yugandharpulicherla4078
    @yugandharpulicherla40783 жыл бұрын

    Nice and informative video. Can you please answer the question asked in interview. How to compare two Scala objects based one variable value.

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Hi... All answers of the questions are available as different videos on data Savvy channel... Let me know if you find anything missing... I will add that... I will add answer to scala questions also

  • @abhinee
    @abhinee3 жыл бұрын

    Pls cover estimating spark cluster size on cloud infrastructure like aws

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Sure Abhinee... Thanks for suggestion

  • @rheaalexander4798
    @rheaalexander47983 жыл бұрын

    Could you please answer...How to achieve optimisation in hive query with columns that have high cardinality

  • @sakshamsrivastava9894

    @sakshamsrivastava9894

    3 жыл бұрын

    may be we can use vectorization also in such scenarios and as he said bucketing on id column can help drastically, apart from it choosing right file format can work as well.

  • @nitishr5197
    @nitishr51973 жыл бұрын

    Informative .also it will be good if the correct answers also mentioned.

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Thanks Nitish... Most of answers are available as dedicated video on channel

  • @chaitanyachimakurthi2370

    @chaitanyachimakurthi2370

    3 жыл бұрын

    @@DataSavvy Sorry i could not get, we have separate video with answers ?

  • @dramaqueen5889
    @dramaqueen58892 жыл бұрын

    I have been working on big data for quite sometime now , but i dont know why I cant clear interviews

  • @hgjhhj3491
    @hgjhhj34912 жыл бұрын

    That broadcast join example looked cooked up 😆😆

  • @call_me_sruti
    @call_me_sruti3 жыл бұрын

    Hey 👋.. thank you for this awesome initiative. Btw one thing the whatsapp link does not work!!

  • @raviteja1875
    @raviteja18753 жыл бұрын

    attach a feedback video to the same it will go long way in knowing what should have been better answered

  • @mohitupadhayay1439
    @mohitupadhayay1439 Жыл бұрын

    This guy was giving Interview for TCS Data science role once.

  • @DataSavvy

    @DataSavvy

    Жыл бұрын

    which Guy

  • @nivedita5639
    @nivedita56393 жыл бұрын

    Can you explain this question with a video. What is the best way to join 3 table in spark.

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Sure Nivedita...

  • @Anonymous-fe2ep
    @Anonymous-fe2ep9 ай бұрын

    Hello, I was asked the followed questions in a AWS developer interview- Q1. We have *sensitive data* coming in from a source and API. Help me design a pipeline to bring in data, clean and transform it and park it. Q2. So where does pyspark come into play in this? Q3. Which all libraries will you need to import to run the above glue job? Q4. What are shared variables in pyspark Q5. How to optimize glue jobs Q6. How to protect sensitive data in your data. Q7. How do you identify sensitive information in your data. Q8. How do you provision a S3 bucket? Q9. How do I check if a file has been changed or deleted? Q10. How do I protect my file having sensitive data stored in S3? Q11. How does KMS work? Q12. Do you know S3 glacier? Q13. Have you worked on S3 glacier?

  • @SouthMoviesAll42
    @SouthMoviesAll427 ай бұрын

    Please add videos for fresher also

  • @rajdeepsinghborana2409
    @rajdeepsinghborana24093 жыл бұрын

    Sir.. can you please provide us Hadoop & Spark developer with SCALA video's Beginners to Perfect It's very very very Useful For us sirr.. because cheaked all types of video's on the youtube no one can do it ... Or sir hai bhi to wo paid courses hai like EDUREKA , Intellipath , Simpllearn etc . So , sir please make it earlier..Need its bhot the students

  • @rajdeepsinghborana2409

    @rajdeepsinghborana2409

    3 жыл бұрын

    Student able to learn and gain more and more knowledge but haven't money 😭

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Sure Randeep.. I will plan for spark course

  • @msbhargavi1846
    @msbhargavi18463 жыл бұрын

    Hi sir, will u plz exaplain difference b/w map and foreach....

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Will create video on this...

  • @gauravlotekar660
    @gauravlotekar6603 жыл бұрын

    It fine i..but it should be more of a discussion than question answer session.

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Hi Gaurav... Are u suggesting, way of interviewing is not appropriate

  • @gauravlotekar660

    @gauravlotekar660

    3 жыл бұрын

    @@DataSavvy no no .. definitely not that. I was saying discussion way of interviewing is more effective as per my opinion. I feel more comfortable and and able to express that way.

  • @Anandhusreekumar
    @Anandhusreekumar3 жыл бұрын

    Very informative. Thanks :) Can you suggest some small Spark project for portfolio building?

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    What is your profile and which Language you use

  • @Anandhusreekumar

    @Anandhusreekumar

    3 жыл бұрын

    @@DataSavvy I'm Scala Spark engineer. I'm am familiar with Cassandra, Kafka, HDFS, Spark.

  • @sindhugarlapati2776

    @sindhugarlapati2776

    3 жыл бұрын

    Same request...can you please suggest some small spark project using python..

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    I am currently working on creating videos and explaining a end to end project..

  • @RanjanKumar-ue5id
    @RanjanKumar-ue5id3 жыл бұрын

    Any resource link to do a spark related mini project ?

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    I am creating a series for spark project... Will post soon

  • @darshitsuthar6653
    @darshitsuthar66533 жыл бұрын

    sir this was really helpful and informative....i'm a fresher and seeking an opportunity to work with big data technologies like hadoop, spark, kafka, etc.....please guide me how do i enter the corporate world starting with these technologies as there are very less firms that hires a fresher for such technologies.....

  • @satyanathparvatham4406

    @satyanathparvatham4406

    3 жыл бұрын

    HAVE you done any projects??/

  • @darshitsuthar6653

    @darshitsuthar6653

    3 жыл бұрын

    @@satyanathparvatham4406 worked on a hive project and other thn that I keep practicing some scenarios (spark) on databricks.

  • @lavuittech3136
    @lavuittech31363 жыл бұрын

    Can you teach us big data from scratch? Your videos are really useful.

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Sure... Which topics are you looking for

  • @lavuittech3136

    @lavuittech3136

    3 жыл бұрын

    @@DataSavvy from basics.

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Ok... Planning that :)

  • @lavuittech3136

    @lavuittech3136

    3 жыл бұрын

    @@DataSavvy Thanks..waiting for it.

  • @rudraganesh1507

    @rudraganesh1507

    3 жыл бұрын

    @@DataSavvy sir do it love u advance

  • @adhikariaman01
    @adhikariaman013 жыл бұрын

    Can you answer to question how do you decide when to increase executor or memory question please ? :)

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    You have to find if your job is compute intensive or IO intensive.. you will get hints of that in logs files... I realised after taking this mock interview that I should create a video on that... I am working on this.. thanks for asking this question 😀

  • @NithinAP007

    @NithinAP007

    3 жыл бұрын

    @@DataSavvy I do not completely agree to what you said. Or may be the question looks a bit off. To start with increasing executor memory will have a limit restricted by the total memory available(depending on the instance type you are using). Memory usage tuning at executor level would need considering 1) the amount of memory used by your objects (you may want your entire dataset to fit in memory), 2) the cost of accessing those objects, and 3) the overhead of garbage collection (if you have high turnover in terms of objects). Now, when we say increasing number of executors - I consider this as scaling needed to meet the job requirements. IO intensive doesn't directly mean increase the number of executors rather increasing the level of parallelism(dependent on the underlying part files(/size) etc.) which starts at the executor level. So, I would rather look at this answer like optimizing executor config for a (considerably)small dataset and tuning the executor config first and then assessing the level of scaling need viz. increasing the number of executors to meet the scale of the actual data. I would like to discuss ahead with you

  • @Manisood001
    @Manisood0013 жыл бұрын

    Please make a course on databricks certification for pyspark

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Sure Mani... Added in my list. Thanks for suggestion :)

  • @Manisood001

    @Manisood001

    3 жыл бұрын

    and please make 1. hands on schema evolution with all formats 2. databricks delta lake 3. how to connect with different datasources You are the only creator to be expected from

  • @ravurunaveenkumar7987
    @ravurunaveenkumar79873 жыл бұрын

    Hi, can you do interview on Scala and spark.

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Hi, I am sorry, i did not understand your question completely.. are u asking u want to do a mock interview with me on scala and spark? if yes... please drop me a message at aforalgo@gmail.com. we can workout this

  • @lavanyareddy310
    @lavanyareddy3103 жыл бұрын

    Hello sir,u r videos are very helpful.....I am unable to join in u r telegram group.....plz help me sir

  • @mohammadjunaidshaik2664
    @mohammadjunaidshaik26643 жыл бұрын

    I am fresher can I start my carrier with big data.

  • @taherkutarwadli8368
    @taherkutarwadli8368 Жыл бұрын

    I am new to data engineering field which language should i choose scala or python . Which language has more job roles ?

  • @kshitizagarwal8389

    @kshitizagarwal8389

    15 күн бұрын

    go with python (pyspark I suggest)

  • @newbeautifulworldq2936
    @newbeautifulworldq2936 Жыл бұрын

    Any new video?i will appreciate

  • @DataSavvy

    @DataSavvy

    Жыл бұрын

    just posted kzread.info/dash/bejne/ooh6zcydftHNXbg.html

  • @NithinAP007
    @NithinAP0073 жыл бұрын

    I liked some of the questions but not the answers completely. Say the HDFS block size, memory used per file in name node and the type safe equality. How do you plan to publish the right content/answers?

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Hi Nithin... This was a completely impromptu interview... Are u looking for answer of any specific question?

  • @MaheshKumar-yz7ns
    @MaheshKumar-yz7ns3 жыл бұрын

    @4.40 is the interviewer expecting ans DAG?

  • @pratikj2538
    @pratikj25383 жыл бұрын

    Can you make one interview video for Bigdata developer with 2-3 yrs of exp.

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Sure Pratik... That is already in plan... This interview also fits in less than 4 year exp category

  • @abhinee
    @abhinee3 жыл бұрын

    Actually asking to compare spark n hadoop is incorrect. Should ask mr vs spark. Also spark has improved insanely so pls interviewers RIP this question

  • @KiranKumar-um2gz

    @KiranKumar-um2gz

    3 жыл бұрын

    its correct. hdfs has own framework and spark has its own.hdfs works in batch processing process where spark works by inmemory computation. everything consider as info dumb

  • @abhinee

    @abhinee

    3 жыл бұрын

    @@KiranKumar-um2gz spark is a framework in itself and it also does batch processing. Pls dnt spread half knowledge

  • @alokdutta4712

    @alokdutta4712

    3 жыл бұрын

    True

  • @hasifs8139

    @hasifs8139

    3 жыл бұрын

    I will still ask this question. When comparing Hadoop with Spark, it is assumed that we are comparing 2 processing engines not a processing engine to file system. We expect a candidate to be sane enough to understand that. Also, Spark is built on top of MR concept this very good question to test your understanding of it.

  • @abhinee

    @abhinee

    3 жыл бұрын

    @@hasifs8139 anyone who has picked up spark in last 3 years does not need to understand mr to be good at spark or data processing. Spark implementation is way different than mr to make any comparison. Do you do same or similar steps to optimise joins in spark and mr, no. You can keep asking this question and rejecting good candidates.

  • @ashutoshrai5342
    @ashutoshrai53423 жыл бұрын

    Sathiyan is a genius

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    I agree :)

  • @jeevithat6038
    @jeevithat60383 жыл бұрын

    Hi it would be nice if you give correct answers if the answer is wrong..

  • @ferozqureshi5228
    @ferozqureshi5228 Жыл бұрын

    If we use 800 executors for 100GB input data like you've mentioned in your example,Spark would be then busy in managing the high volume of executors rather than on data processing. So, it could better to use an executor for 5-10GB which would leave us to use 10-20 executors for 100GB data. If you're having any explanation for using 800 executors, then do post it.

  • @DataSavvy

    @DataSavvy

    Жыл бұрын

    let me look into this

  • @kshitizagarwal8389

    @kshitizagarwal8389

    15 күн бұрын

    not 800 executors- he said to use 800 cores for maximum parallelism- Keep five courses in each executor, resulting into 160 executors in total.

  • @anudeepyerikala8517
    @anudeepyerikala85173 жыл бұрын

    arindam patra the king of datasavvy

  • @msbhargavi1846
    @msbhargavi18463 жыл бұрын

    Hi, why hadoop doesn't support small files ?

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    He meant to ask... Why small files are not good for Hadoop... Hadoop can store small files though

  • @msbhargavi1846

    @msbhargavi1846

    3 жыл бұрын

    @@DataSavvy thanks, But why its not good? performance issue.. how it exatly?

  • @rajeshkumardash1222

    @rajeshkumardash1222

    3 жыл бұрын

    @@msbhargavi1846 If you have too many small files then your name node will have to keep metadata for these each of these metadta takes around 100-150 bytes so just think if you have millions of small files how much memory name node has to exhaust to manage this ....

  • @msbhargavi1846

    @msbhargavi1846

    3 жыл бұрын

    @@rajeshkumardash1222 yes got it.... thanks

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Thanks Rajesh

  • @nikhilmishra7572
    @nikhilmishra75723 жыл бұрын

    Whats the purpose of using '===' while joining? nice video btw.

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Thanks Nikhil... Will post a video about the answer in few days :)

  • @harshavardhanreddyakiti4655

    @harshavardhanreddyakiti4655

    3 жыл бұрын

    @@DataSavvycan you post something like this on Airflow?

  • @abhirupghosh806

    @abhirupghosh806

    3 жыл бұрын

    @@DataSavvy My best guess is = and == are already reserved operators. = is assignment operator like val a=5 == is comparison opertor if object type is same like how we use in a a normal string comparison for example === is comparison operator if object type is different like when we compare two different colums for different datasets dataframes

  • @Manisood001
    @Manisood0013 жыл бұрын

    wow, kindly make hive integration with spark

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Hmmmm... What is the problem that you are facing in integration...

  • @Manisood001

    @Manisood001

    3 жыл бұрын

    @@DataSavvy In databricks When i am creating a managed hive table by "using json" keyword Its creating fine but when i am creating external table its showing error

  • @Manisood001

    @Manisood001

    3 жыл бұрын

    @@DataSavvy Why "using keyword dosent work with external tables

  • @sheshkumar8502
    @sheshkumar85023 ай бұрын

    Hi how are you

  • @atanu4321
    @atanu43213 жыл бұрын

    Small file issue 16:45

  • @GauravSingh-dn2yx
    @GauravSingh-dn2yx3 жыл бұрын

    Everyone is in nightwear 😂😂😂

  • @lavakumar5181
    @lavakumar51813 жыл бұрын

    Hi sir, if you are providing interview guidance personally please let me know..I'll contact personally...I need guidance

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Join our group... It's very vibrant and people help each other a lot

  • @subhaniguddu2870

    @subhaniguddu2870

    3 жыл бұрын

    Please share group link we will join

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    chat.whatsapp.com/KKUmcOGNiixH8NdTWNKMGZ

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    @@hasnainmotagamwala2608 Hi, number of WhatsApp group were becoming difficult to manage... So we have moved to telegram... It allows us to have one big group with lots of other benefits... Join t.me/bigdata_hkr

  • @arunnautiyal2424

    @arunnautiyal2424

    3 жыл бұрын

    @@DataSavvy it is giving not authorized to access.

  • @sangeethanagarajan278
    @sangeethanagarajan2782 жыл бұрын

    How many experience this candidate is having?

  • @THEPOSTBYLOT
    @THEPOSTBYLOT3 жыл бұрын

    Please create new watsup grp as it is full

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Hi XYZ, number of WhatsApp group were becoming difficult to manage... So we have moved to telegram... It allows us to have one big group with lots of other benefits... Join t.me/bigdata_hkr

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Hi XYZ, number of WhatsApp group were becoming difficult to manage... So we have moved to telegram... It allows us to have one big group with lots of other benefits... Join t.me/bigdata_hkr

  • @seaofpoppies
    @seaofpoppies3 жыл бұрын

    There is no such thing as in memory processing.. Memory is used to store data that can be reused. 4 years back I was grilled on this 'in memory processing' stuff in one of the big4 firm.

  • @arindampatra6283

    @arindampatra6283

    3 жыл бұрын

    You should google the meaning of in memory processing once..It doesn't mean that your memory will process the Data for you 😂😂😂😂 Even kids there in school know that cpu does the actual computations..

  • @b__05

    @b__05

    3 жыл бұрын

    No In- memory is usually your RAM, where data is stored and computed in parallel. Hence it is fast. Can you just let me know how you got grilled for this?

  • @EnimaMoe

    @EnimaMoe

    3 жыл бұрын

    hadoop work in batches by moving data in the hdfs. Meanwhile Spark does its operation in-memory, the data is cached in memory and all operations are done live. Unless you were asked questions for hadoop i don't see how you could get grilled for this ...

  • @seaofpoppies

    @seaofpoppies

    3 жыл бұрын

    @@EnimaMoe Spark doesnot do operations in Memory. In fact, no processing happens in memory. I am not talking about the concept. I am talking about the phrase that is used "in memory processing". For those advising me to Google about spark, just an FYI, It's been years since in am using spark. You can always challenge whatever is written ot told by someone. Tc.

  • @omkarkulkarni9202
    @omkarkulkarni92023 жыл бұрын

    Q1: What made you move to Spark from Hadoop/MR? Both the question and answer is wrong. Hadoop is a file system whereas spark is a framework/library to process data in distributed fashion. There is no such thing as 1 better than other. It's like comparing apples and oranges.

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Hi Omkar... Hadoop is combination of Map Reduce and HDFS.. hdfs is file system and MR is processing engine... Interviewer wanted to know why u prefer spark as compare to Hadoop MR style of processing... You will generally see people who are working in big data using this kind of language... Generally people who started with Hadoop and then moves to spark processing engine later

  • @omkarkulkarni9202

    @omkarkulkarni9202

    3 жыл бұрын

    @@DataSavvy Can you tell me where and how you do MR using Hadoop? And can you elaborate what exactly you mean by "Hadoop MR style of programming?" If the interviewer is using this language, clearly he has learnt and limited his knowledge to tutorials. Someone who has worked on large scale clusters using EMR or his own EC2 cluster wont use such vague language.

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    Hi Omkar... Plz read en.m.wikipedia.org/wiki/MapReduce ... or Google Hadoop map reduce...

  • @omkarkulkarni9202

    @omkarkulkarni9202

    3 жыл бұрын

    @@DataSavvy I understand what is map reduce.. its a paradigm and not a framework/library that you are asking. The interviewer asked this question: Interviewer wanted to know why u prefer spark as compare to Hadoop MR style of processing? This question itself is wrong as Spark is a framework that allows you to process data using map reduce paradigm. There is no such thing as "Hadoop MR style of processing".

  • @DataSavvy

    @DataSavvy

    3 жыл бұрын

    I see... You seems to have issue with words used to frame the question... I think we should focus on intent of question, rather than thinking to much about the words...

  • @iam_krishna15
    @iam_krishna152 жыл бұрын

    This one can't considered as spark interview.

  • @DataSavvy

    @DataSavvy

    2 жыл бұрын

    Hi Krishna... Please share your expectation... I will cover that as another video

  • @gauthamn1603
    @gauthamn16033 жыл бұрын

    Please dont use the word "we" use "i"

  • @hasifs8139

    @hasifs8139

    3 жыл бұрын

    Never use 'I' unless you are working alone.

  • @MrTeslaX

    @MrTeslaX

    3 жыл бұрын

    @@hasifs8139 Always use I in interview

  • @hasifs8139

    @hasifs8139

    3 жыл бұрын

    @@MrTeslaX Thanks for the advice, luckily I rarely have to go for interviews nowadays. Personally, I don't hire people who use a lot of 'I' in their answers, because they are just ignoring the existence of others in the team. Most likely they are a bad team player and don't want such people in my team.

  • @MrTeslaX

    @MrTeslaX

    3 жыл бұрын

    @@hasifs8139 Thanks for your response. I live and work in the US and have attended FANG companies and other small companies as well. One of the most important things I was told to keep in mind was to highlight my contribution and achievement and not talk about the overall work done. Be specific and talk about the work you have done and use I while talking about them.

  • @hasifs8139

    @hasifs8139

    3 жыл бұрын

    @@MrTeslaX Thanks for your explanation. Yes, you must definitely highlight your contributions and achievements within the project. All I am saying is that you should not give the impression that you did it all on your own. Also what difference does it make, if you are living in the US or Germany(where I am) or anywhere else?

  • @ldk6853
    @ldk6853Ай бұрын

    Terrible accent… 😮