We cover all the topics in Big Data world. As part of this channel, we intend to bring videos related to big data, hadoop, spark no sql databases, hbase, cassandra, machine learning deep learning etc.
We also help students in free interview preparation. Please reach us without any hesitation.
You can Join Our Whatsapp Group and Telegram Group To Discuss and Prepare Interview questions.
Stay Updated with all Material on Github
If You want to connect with me, connect on Linkedin Please
Пікірлер
Please share the doc that you are using in this video
very informative one. Thanks Buddy.
In lambda architecture so far no one has explained how de duplication is handled when the batch and stream processing data is combined in serving layer? whatever data is processed by streaming layer will eventually gets processed in batch layer? if this is true then previous streaming layer processed data is no more required. so do we need to remove that data processed by streaming layer?
sir can you tell me some about housekeeping executive spark deta. i dont understand spark word. facility company JLL requird he have spark exprience
"flatMapGroupsWithState" is a statefull operation? Do you have any tutorial on it?
To whom may be concerned when to use GroupByKey over ReduceByKey: groupByKey() can be used for non-associative operations, where the order of application of the operation matters. For example, if we want to calculate the median of a set of values for each key, we cannot use reduceByKey(), since median is not an associative operation.
Awesome!! wonderful explanation. Before this, I have see so many videos but none of those explained the steps in such a clarity. Thank you sharing.
vak mugi
Terrible accent… 😮
Hi sir pls help me with the following requirement id|count| +---+-----+ | a| 3| | b| 2| | c| 4| +---+-----+ need the following output using spark a a a b b c c c c
What do you mean by application needing a lot of joins? Can you please clarify how the joins are affecting the architecture decision?
i have one doubt: reserved memory and yarn overhead memory are same ? because reserved memory also stored spark internals. Thank you for your time.
how i join small table with big table but i want to fetch all the data in small table like the small table is 100k record and large table is 1 milion record df = smalldf.join(largedf, smalldf.id==largedf.id , how = 'left_outerjoin') it makes out of memory and i cant do broadcast the small df idont know why what is best case here pls help
Hi @all I just got to know about the wonderful videos in datasavvy channel. In that executor OOM - big partitions slide, in spark every partition is of block size only ryt(128MB) , then how come big partition will cause an issue? Can Simeon please explain this? Little confused here Even if there is 10gb file , when spark reads the file it creates around 80 partition of 128mb. Even if one of the partition is high it cannot increase 128mb ryt.. then how come OOM occurs??
Hi how are you
amazing
I have who take care of task sheduling
If we maintain replica of data in three diff racks in hadoop. if we submit job we get results right. hy we dont get copies of data execution. how can / what is the operation that is there in hadoop only one block of data ned to process in hadoop if we have two more duplictaes
Great content thank you
very well explained
Great content! Are you on Topmate or any other platform where I can connect with you. Need some career advice/guidance from you.
Then people arent using dataset everywhere?
Hi Harjeet, Getting Kafka utils not found error while creating dstream
where is the volume?
Having a diff table at the end of the video would be appreciated.
2. The cache method is used to persist the DataFrame or RDD in memory by default. It is a shorthand for calling persist() with the default storage level, which is MEMORY_ONLY 3. The persist method allows you to specify a storage level for persisting the DataFrame or RDD. This storage level can include options such as MEMORY_ONLY, MEMORY_ONLY_SER, DISK_ONLY, MEMORY_AND_DISK, etc.
Map - 1. Support only for RDD not for the Dataframe 2. Operation on the all the row 3. mapPartitions() - heavy initialization executes only once for each partition instead of every record ( row )
1. Parsing - create an abstract syntax tree. 2. Analysis - Catalyst analyzer performs semantic analysis on tree. This includes resolving references, type checking, and creating a logical plan. The analyzer also infers data types. 3. Logical optimisation - Rewrite the plan into a more efficient form. This includes predicate pushdown, constant folding. 4. Physical planning - Spark stages and tasks created. 5. Physical optimisation - optimized further by considering factors like data partitioning, join order, and choosing the most efficient physical operators 6.Code Generation - generates Java bytecode for the optimized physical plan
groupBy in PySpark: 1. transformation that groups the elements of a DataFrame or RDD based on the specified key or keys. 2. does not perform any aggregation on the grouped data. 3. non-key-value pair data. - Use group by key reduceByKey in PySpark: 1. operation is specifically designed for key-value pair RDDs. It groups the data by key 2.performs a reduction or aggregation operation 3. Key value pair - Use the group by key
repartition: 1. is used to increase or decrease the RDD/DataFrame partitions 2. More shuffle Coalesce : 2. Reduce the partition 2. No shuffle 3. Less expensive
1. master-slave architecture and master is called the “Driver” and slaves are called “Workers” 2. context that is an entry point to your application
1. Transformations are operations on RDDs or DataFrames that create a new RDD or DataFrame from an existing one. 2. lazily evaluated, meaning the execution is deferred until an action is called. 3. No return result. Action : 1. Actions are operations that return a value to the driver program
1. pyspark.SparkContext is an entry point to the PySpark functionality that is used to communicate with the cluster and to create an RDD 2. one SparkContext per JVM 3. in order to create another first you need to stop the existing one using stop() method 4. PySpark shell creates and provides sc object, which is an instance of SparkContext class. 5. Creating a SparkSession creates a SparkContext internally and exposes the sparkContext variable to use.
Lineage - Logical plan Dag - Physical plan
Great explanation. Excellent teaching. Please do a deeper dive into coalesce and partition.Command and several scenarios (e.g. coalesce after repartition, re-partition after coalesce, coalesce after coalesce etc). I know that some of these may not be meaningful but I have seen several of your videos and you are GREAT teacher.
still relevent as of today and frequently asked. The practical on databricks made things crystal clear.
how to decide calculate no of executor and memory when multiple jobs are runing by multiple proresses in cluster? what is the best way to tune our process. is this good way to give 50 executor with 16gb each. could you please explain with example.
Do watermarking solves late arriving records along with flatMapGroupWithState for complex scenerios in databricks?
💕 'promosm'
please add video for optimization in spark .and how to moniter performance in spark UI
In this scenario-based question can we create an end-to-end pipeline using the Kafka and power BI dashboard like..we can connect with your database as a source connector and for the transformation we can use KSQL DB where we perform some business-level transformation and after that store it into the Kafka-topic and then connect with the power BI for dashboard? @dataSavvy or someone, can u check Am I right thinking?
That music is good and painful....
Feedback noted... This is changed in all next videos
Parquet ki spelling ye hai..
Very clearly explained. Thank you!
Thanks @shaleen
Why is it called lambda architecture? isn't is same as an ETL/ELT workflow. Is it only the nature of source data(stream/event) that makes it a lambda architecture? My question is more around the etymology of this architecture, an architecture on Databricks platform like lakehouse can work on the discussed usecases?
Glad to see you back with useful videos
Thank you Gauri
helpful video!
Thanks Naveen
It's very clear. Very helpful to upgrade ourselves in system design and architect skills 👍
Thank you Siva
Very good explanation. Thanks!
Thanks sourbh
amazing can you make one video on how to Backfilling strategy when we need to process historical data