Data Savvy

Data Savvy

We cover all the topics in Big Data world. As part of this channel, we intend to bring videos related to big data, hadoop, spark no sql databases, hbase, cassandra, machine learning deep learning etc.

We also help students in free interview preparation. Please reach us without any hesitation.

You can Join Our Whatsapp Group and Telegram Group To Discuss and Prepare Interview questions.
Stay Updated with all Material on Github
If You want to connect with me, connect on Linkedin Please

Пікірлер

  • @biswadeeppatra1726
    @biswadeeppatra17268 күн бұрын

    Please share the doc that you are using in this video

  • @VivekKBangaru
    @VivekKBangaru9 күн бұрын

    very informative one. Thanks Buddy.

  • @akashhudge5735
    @akashhudge573510 күн бұрын

    In lambda architecture so far no one has explained how de duplication is handled when the batch and stream processing data is combined in serving layer? whatever data is processed by streaming layer will eventually gets processed in batch layer? if this is true then previous streaming layer processed data is no more required. so do we need to remove that data processed by streaming layer?

  • @jasbirkumar7770
    @jasbirkumar777022 күн бұрын

    sir can you tell me some about housekeeping executive spark deta. i dont understand spark word. facility company JLL requird he have spark exprience

  • @deepanshuaggarwal7042
    @deepanshuaggarwal704228 күн бұрын

    "flatMapGroupsWithState" is a statefull operation? Do you have any tutorial on it?

  • @thelazitech
    @thelazitechАй бұрын

    To whom may be concerned when to use GroupByKey over ReduceByKey: groupByKey() can be used for non-associative operations, where the order of application of the operation matters. For example, if we want to calculate the median of a set of values for each key, we cannot use reduceByKey(), since median is not an associative operation.

  • @sreekantha2010
    @sreekantha2010Ай бұрын

    Awesome!! wonderful explanation. Before this, I have see so many videos but none of those explained the steps in such a clarity. Thank you sharing.

  • @BishalKarki-pe8hs
    @BishalKarki-pe8hsАй бұрын

    vak mugi

  • @ldk6853
    @ldk68532 ай бұрын

    Terrible accent… 😮

  • @maturinagababu98
    @maturinagababu982 ай бұрын

    Hi sir pls help me with the following requirement id|count| +---+-----+ | a| 3| | b| 2| | c| 4| +---+-----+ need the following output using spark a a a b b c c c c

  • @ramyajyothi8697
    @ramyajyothi86972 ай бұрын

    What do you mean by application needing a lot of joins? Can you please clarify how the joins are affecting the architecture decision?

  • @suresh.suthar.24
    @suresh.suthar.243 ай бұрын

    i have one doubt: reserved memory and yarn overhead memory are same ? because reserved memory also stored spark internals. Thank you for your time.

  • @ahmedaly6999
    @ahmedaly69993 ай бұрын

    how i join small table with big table but i want to fetch all the data in small table like the small table is 100k record and large table is 1 milion record df = smalldf.join(largedf, smalldf.id==largedf.id , how = 'left_outerjoin') it makes out of memory and i cant do broadcast the small df idont know why what is best case here pls help

  • @naveena2226
    @naveena22263 ай бұрын

    Hi @all I just got to know about the wonderful videos in datasavvy channel. In that executor OOM - big partitions slide, in spark every partition is of block size only ryt(128MB) , then how come big partition will cause an issue? Can Simeon please explain this? Little confused here Even if there is 10gb file , when spark reads the file it creates around 80 partition of 128mb. Even if one of the partition is high it cannot increase 128mb ryt.. then how come OOM occurs??

  • @sheshkumar8502
    @sheshkumar85023 ай бұрын

    Hi how are you

  • @praptijoshi9102
    @praptijoshi91024 ай бұрын

    amazing

  • @adityakvs3529
    @adityakvs35294 ай бұрын

    I have who take care of task sheduling

  • @kaladharnaidusompalyam851
    @kaladharnaidusompalyam8514 ай бұрын

    If we maintain replica of data in three diff racks in hadoop. if we submit job we get results right. hy we dont get copies of data execution. how can / what is the operation that is there in hadoop only one block of data ned to process in hadoop if we have two more duplictaes

  • @prathapganesh7021
    @prathapganesh70214 ай бұрын

    Great content thank you

  • @RakeshYadav-cw3zf
    @RakeshYadav-cw3zf4 ай бұрын

    very well explained

  • @ayushigupta542
    @ayushigupta5424 ай бұрын

    Great content! Are you on Topmate or any other platform where I can connect with you. Need some career advice/guidance from you.

  • @Pratik0917
    @Pratik09174 ай бұрын

    Then people arent using dataset everywhere?

  • @user-du9wb1oe7t
    @user-du9wb1oe7t4 ай бұрын

    Hi Harjeet, Getting Kafka utils not found error while creating dstream

  • @harshitsingh9842
    @harshitsingh98424 ай бұрын

    where is the volume?

  • @harshitsingh9842
    @harshitsingh98424 ай бұрын

    Having a diff table at the end of the video would be appreciated.

  • @pandurangbhadange25
    @pandurangbhadange254 ай бұрын

    2. The cache method is used to persist the DataFrame or RDD in memory by default. It is a shorthand for calling persist() with the default storage level, which is MEMORY_ONLY 3. The persist method allows you to specify a storage level for persisting the DataFrame or RDD. This storage level can include options such as MEMORY_ONLY, MEMORY_ONLY_SER, DISK_ONLY, MEMORY_AND_DISK, etc.

  • @pandurangbhadange25
    @pandurangbhadange254 ай бұрын

    Map - 1. Support only for RDD not for the Dataframe 2. Operation on the all the row 3. mapPartitions() - heavy initialization executes only once for each partition instead of every record ( row )

  • @pandurangbhadange25
    @pandurangbhadange254 ай бұрын

    1. Parsing - create an abstract syntax tree. 2. Analysis - Catalyst analyzer performs semantic analysis on tree. This includes resolving references, type checking, and creating a logical plan. The analyzer also infers data types. 3. Logical optimisation - Rewrite the plan into a more efficient form. This includes predicate pushdown, constant folding. 4. Physical planning - Spark stages and tasks created. 5. Physical optimisation - optimized further by considering factors like data partitioning, join order, and choosing the most efficient physical operators 6.Code Generation - generates Java bytecode for the optimized physical plan

  • @pandurangbhadange25
    @pandurangbhadange254 ай бұрын

    groupBy in PySpark: 1. transformation that groups the elements of a DataFrame or RDD based on the specified key or keys. 2. does not perform any aggregation on the grouped data. 3. non-key-value pair data. - Use group by key reduceByKey in PySpark: 1. operation is specifically designed for key-value pair RDDs. It groups the data by key 2.performs a reduction or aggregation operation 3. Key value pair - Use the group by key

  • @pandurangbhadange25
    @pandurangbhadange254 ай бұрын

    repartition: 1. is used to increase or decrease the RDD/DataFrame partitions 2. More shuffle Coalesce : 2. Reduce the partition 2. No shuffle 3. Less expensive

  • @pandurangbhadange25
    @pandurangbhadange254 ай бұрын

    1. master-slave architecture and master is called the “Driver” and slaves are called “Workers” 2. context that is an entry point to your application

  • @pandurangbhadange25
    @pandurangbhadange254 ай бұрын

    1. Transformations are operations on RDDs or DataFrames that create a new RDD or DataFrame from an existing one. 2. lazily evaluated, meaning the execution is deferred until an action is called. 3. No return result. Action : 1. Actions are operations that return a value to the driver program

  • @pandurangbhadange25
    @pandurangbhadange254 ай бұрын

    1. pyspark.SparkContext is an entry point to the PySpark functionality that is used to communicate with the cluster and to create an RDD 2. one SparkContext per JVM 3. in order to create another first you need to stop the existing one using stop() method 4. PySpark shell creates and provides sc object, which is an instance of SparkContext class. 5. Creating a SparkSession creates a SparkContext internally and exposes the sparkContext variable to use.

  • @pandurangbhadange25
    @pandurangbhadange254 ай бұрын

    Lineage - Logical plan Dag - Physical plan

  • @pankajchikhalwale8769
    @pankajchikhalwale87694 ай бұрын

    Great explanation. Excellent teaching. Please do a deeper dive into coalesce and partition.Command and several scenarios (e.g. coalesce after repartition, re-partition after coalesce, coalesce after coalesce etc). I know that some of these may not be meaningful but I have seen several of your videos and you are GREAT teacher.

  • @rakeshchaudhary8255
    @rakeshchaudhary82554 ай бұрын

    still relevent as of today and frequently asked. The practical on databricks made things crystal clear.

  • @RAVIKUMAR74
    @RAVIKUMAR744 ай бұрын

    how to decide calculate no of executor and memory when multiple jobs are runing by multiple proresses in cluster? what is the best way to tune our process. is this good way to give 50 executor with 16gb each. could you please explain with example.

  • @neerajbhanottt
    @neerajbhanottt4 ай бұрын

    Do watermarking solves late arriving records along with flatMapGroupWithState for complex scenerios in databricks?

  • @clydeschultz1687
    @clydeschultz16875 ай бұрын

    💕 'promosm'

  • @prajaktadange6916
    @prajaktadange69165 ай бұрын

    please add video for optimization in spark .and how to moniter performance in spark UI

  • @anshusharaf2019
    @anshusharaf20195 ай бұрын

    In this scenario-based question can we create an end-to-end pipeline using the Kafka and power BI dashboard like..we can connect with your database as a source connector and for the transformation we can use KSQL DB where we perform some business-level transformation and after that store it into the Kafka-topic and then connect with the power BI for dashboard? @dataSavvy or someone, can u check Am I right thinking?

  • @pramodp8161
    @pramodp81615 ай бұрын

    That music is good and painful....

  • @DataSavvy
    @DataSavvy5 ай бұрын

    Feedback noted... This is changed in all next videos

  • @mydreamprisha
    @mydreamprisha5 ай бұрын

    Parquet ki spelling ye hai..

  • @shaleenarora2894
    @shaleenarora28945 ай бұрын

    Very clearly explained. Thank you!

  • @DataSavvy
    @DataSavvy5 ай бұрын

    Thanks @shaleen

  • @Adi300594
    @Adi3005945 ай бұрын

    Why is it called lambda architecture? isn't is same as an ETL/ELT workflow. Is it only the nature of source data(stream/event) that makes it a lambda architecture? My question is more around the etymology of this architecture, an architecture on Databricks platform like lakehouse can work on the discussed usecases?

  • @gaurisharma6331
    @gaurisharma63316 ай бұрын

    Glad to see you back with useful videos

  • @DataSavvy
    @DataSavvy6 ай бұрын

    Thank you Gauri

  • @naveenbhandari5097
    @naveenbhandari50976 ай бұрын

    helpful video!

  • @DataSavvy
    @DataSavvy6 ай бұрын

    Thanks Naveen

  • @sivakumark8734
    @sivakumark87346 ай бұрын

    It's very clear. Very helpful to upgrade ourselves in system design and architect skills 👍

  • @DataSavvy
    @DataSavvy6 ай бұрын

    Thank you Siva

  • @sourbhydv
    @sourbhydv6 ай бұрын

    Very good explanation. Thanks!

  • @DataSavvy
    @DataSavvy6 ай бұрын

    Thanks sourbh

  • @RohanKumar-mh3pt
    @RohanKumar-mh3pt6 ай бұрын

    amazing can you make one video on how to Backfilling strategy when we need to process historical data