scd2 in spark | Lec-24
In this video I have talked about slowly changing dimension type 2.
Directly connect with me on:- topmate.io/manish_kumar25
Discord channel:- / discord
SCD Data:-
customer_dim_data = [
(1,'manish','arwal','india','N','2022-09-15','2022-09-25'),
(2,'vikash','patna','india','Y','2023-08-12',None),
(3,'nikita','delhi','india','Y','2023-09-10',None),
(4,'rakesh','jaipur','india','Y','2023-06-10',None),
(5,'ayush','NY','USA','Y','2023-06-10',None),
(1,'manish','gurgaon','india','Y','2022-09-25',None),
]
customer_schema= ['id','name','city','country','active','effective_start_date','effective_end_date']
customer_dim_df = spark.createDataFrame(data= customer_dim_data,schema=customer_schema)
sales_data = [
(1,1,'manish','2023-01-16','gurgaon','india',380),
(77,1,'manish','2023-03-11','bangalore','india',300),
(12,3,'nikita','2023-09-20','delhi','india',127),
(54,4,'rakesh','2023-08-10','jaipur','india',321),
(65,5,'ayush','2023-09-07','mosco','russia',765),
(89,6,'rajat','2023-08-10','jaipur','india',321)
]
sales_schema = ['sales_id', 'customer_id','customer_name', 'sales_date', 'food_delivery_address','food_delivery_country', 'food_cost']
sales_df = spark.createDataFrame(data=sales_data,schema=sales_schema)
spark.apache.org/docs/latest/...
spark.apache.org/docs/latest/...
For more queries reach out to me on my below social media handle.
Follow me on LinkedIn:- / manish-kumar-373b86176
Follow Me On Instagram:- / competitive_gyan1
Follow me on Facebook:- / manish12340
My Second Channel -- / @competitivegyan1
Interview series Playlist:- • Interview Questions an...
My Gear:-
Rode Mic:-- amzn.to/3RekC7a
Boya M1 Mic-- amzn.to/3uW0nnn
Wireless Mic:-- amzn.to/3TqLRhE
Tripod1 -- amzn.to/4avjyF4
Tripod2:-- amzn.to/46Y3QPu
camera1:-- amzn.to/3GIQlsE
camera2:-- amzn.to/46X190P
Pentab (Medium size):-- amzn.to/3RgMszQ (Recommended)
Pentab (Small size):-- amzn.to/3RpmIS0
Mobile:-- amzn.to/47Y8oa4 ( Aapko ye bilkul nahi lena hai)
Laptop -- amzn.to/3Ns5Okj
Mouse+keyboard combo -- amzn.to/3Ro6GYl
21 inch Monitor-- amzn.to/3TvCE7E
27 inch Monitor-- amzn.to/47QzXlA
iPad Pencil:-- amzn.to/4aiJxiG
iPad 9th Generation:-- amzn.to/470I11X
Boom Arm/Swing Arm:-- amzn.to/48eH2we
My PC Components:-
intel i7 Processor:-- amzn.to/47Svdfe
G.Skill RAM:-- amzn.to/47VFffI
Samsung SSD:-- amzn.to/3uVSE8W
WD blue HDD:-- amzn.to/47Y91QY
RTX 3060Ti Graphic card:- amzn.to/3tdLDjn
Gigabyte Motherboard:-- amzn.to/3RFUTGl
O11 Dynamic Cabinet:-- amzn.to/4avkgSK
Liquid cooler:-- amzn.to/472S8mS
Antec Prizm FAN:-- amzn.to/48ey4Pj
Пікірлер: 44
There was one mistake in the country name of records where customer_name = Ayush. Instead of food_delivery_country it should be country. I have given the corrected code here. Please change the code accordingly. old_records = joined_data.where( (col("food_delivery_address") != col("city")) & (col("active") == "Y"))\ .withColumn("active", lit("N"))\ .withColumn("effective_end_date", col("sales_date"))\ .select( "customer_id", "customer_name", "city", "country", "active", "effective_start_date", "effective_end_date" )
@akshaychowdhary8534
2 ай бұрын
what made u choose country plz explain ? i did run the code & found that country will change from RUSSIA to USA post this code modification in old_red df but not clear.
Must do question for experienced people. Very important
aap bhut asan sabd m hindi m smjhate ho...uske liye bhut bhut Thank you
Thank you Bhaiya, ajj mera practical v khatm ho gaya.... You are too good in explaining Joined a institute for azure data engineer but didnt get enough knowledge in databricks wise, the topic i get to know from you: read mode failfast,permissive,dropmalformed JSON (multi line,single line) Corrupt file handling, Parquet in details, df write/save bucketBy and partitionBY, lit(), union and unional all when otherwise count() in tranform and action left anti/left semi window functions SCD2 Fundamental: Spark UI Catalyst Optimizer/SPARK sql enginee sort vs shuffle join Spark Memory Adaptive Query Execution Salting
bhai loving your video today i have completed whole playlist of practical
Hi manish ... great video ... Eagerly waiting for the problems faced in Spark project video... Please make it next. Thank You.
clear explanation
very nice and easily explained spark sir
incredible work Manish. I just completed you spark practical series.. :)
Awesome 👌
Mind Bending !! :D
Thanks Manish. Very informative. Can you also make a video on databricks UI? How to interpret and understand GangaliaUI metrics
Thankyou
I am going through your playlist, it is a wonderful playlist. But I see there are missing lectures for spark practical. Lec 21, 21 and 23. Please provide that. Thank you.
bhai essi ak to intezar tha ummed karta hu yea real time senario m b use ho jayega thanks
why we didnt use surrogate keys here to implement scd 2?
very nice tutorial manish but one error in select clause of old_records dataframe select county column in place of food_delivery_country ... it took a lot of time for me to understand this error as i am learning SCD newly....please update
@manish_kumar_1 I feel that, in new_records df, "withColumn('active',lit('Y'))" is redundant, as we are already filtering according to (col('active')=='Y') condition. One requent from my end, make a video on Incremental Loading as well. Anyways excellent content as always .♥
Bhai aap konsa pen tablet use kr rhe ho likhne ke liye pls btaye
Is this the last video of this playlist?
sir isme, lecture 20 ke baad seedha lecture 24 hai vaha kuch missing hai kya ? and one doubt in the end part when we are filtering out records using rank on id,active in the that part shouldn't the row number function be used ? because rank will assign the same rank to same values id and active but row_number wiil give 1,2,3... and so on ranks for same id,value with respect to date?
bhaiya..ek video banaye delta lake delta table and its implementation in pyspark..please apke channel me nai mila
also make incremental load with insert, update & delete
@manish_kumar_1
10 ай бұрын
Sure
Aur kitne video aayege spark ke??
Manish pls complete spark series !!
@manish_kumar_1
9 ай бұрын
It's almost completed. Now I may add few videos in future . Working on new series that is data modelling. Soon videos will be out
One small error in the final data , for the inactive record for customer name ayush , it's city is NY but country is russia . Rest is great . Good tutorial overall . Thanks .
@saumyaranjannayak9179
6 ай бұрын
yes there he has to select county column but he selected food_delivery_country
Sir I am getting issue 'NoneType' object has no attribute 'union', while doing new_records_df.union(old_records_df)
@manish_kumar_1
3 ай бұрын
Koi ek df me aapne .show v laga rakha hoga jis se ye error aa sakta hai
@seleniumautomation6552
3 ай бұрын
@@manish_kumar_1 Yes sir ho gya tha resolve. Sir ye 24 lecture karne se ho jaayega clear interview?
Yeh series complete ho gayi kya bhai? Or kya topics and kitani videos rhe gayi hain.
@manish_kumar_1
9 ай бұрын
Haan ye complete ho gaya hai.
bhi or aage ki video kha h?
@manish_kumar_1
8 ай бұрын
Isme itna hi video hai. Baaki aap leetcode se practice kijiye
data nhi provide kiya discription m
@manish_kumar_1
10 ай бұрын
Added
@sabesanj5509
9 ай бұрын
Manish bro please provide us the document for SCD2 as I don’t understand hindi much..
@@manish_kumar_1 bhau, data description me daalna bhool gaye..! :P koi nhi ChatGPT kab kaam ayega??! ;)
@manish_kumar_1
10 ай бұрын
Added
Manish bro please provide me the document or website link for SCD2 as I don’t understand hindi much..
@manish_kumar_1
9 ай бұрын
You just checkout the code written. And you will be good to go. I don't have any resource as such. When I had faced this issue then I had implemented the way I have shown in my videos