How to pick between Kimball, One Big Table, and Relational Modeling as a data engineer
We'll be covering:
- When to use One Big Table modeling vs Kimball
- How to use Struct and Array and Array of Struct to get what you want
Make sure you have a DataExpert.io account to get the most out of this session.
Join www.DataExpert.io/classroom/z... to get the queries!
Пікірлер: 89
Literally this is the ONLY video on KZread on how to data model with a hands-on, real-world example. All other videos talk about concepts, which sounds very abstract. BTW, this guy seems high but how come he can be this sharp at the same time? amazing
@chandanjha3205
Ай бұрын
When he is high, he chooses right data model. When he is not, he simply dumps anything in a DB and ask analytics to figure out
41 mins of absolute data modeling concepts explained.
Wow. What a video. Randomly popped in feed and never completed a random 40 min video in one go.
@EcZachly_
2 ай бұрын
Glad you liked it!
@shubhamsharma-ne2ke
2 ай бұрын
@@EcZachly_ loved it. Planning to take your course soon. Need to brush up the prerequisites:)
Please upload more of these long form videos that go in depth
Hi Zach, please keep making these, your videos are incredibly insightful.
Bro, when prod data isn't modeled correctly, asking a data engineer to address all data quality issues is like asking your plumber to get you clean water. That all said, the right integration of ETL Process, Data Quality, and Business metadata (and competent boss that can support you in doing so) can help solve many DQ problems. Populating that metadata repo requires lots of end user interaction and, if your company is mature, data governance.
Thank you very much Zach for explaining these concepts with real life examples. This content is gold mine and extremely helpful. Please keep making more and more such informative and insightful videos 😃
Wonderful content, Zach! God bless you!
An absolute gem of a video!
This is one of the best dimensional data modelling primers out there. Using it for my Meta DE onsite, and it has helped me to clearify some concepts, that I thought I "kind of" understood but with this video now I clearly see the reasonign and real applications. Thank you!
Once again you killed it Zach! You just made my data engineering journey easy.
Glad you were able to get a version out with sound. Thanks.
@EcZachly_
3 ай бұрын
Thanks! I felt so bad that the stream was silent
@murugesanrajasekaran5032
3 ай бұрын
@@EcZachly_no worries. You’re doing a wonderful job for the DE community. Much appreciated
This video is priceless, thank you for sharing
😮 amazed by this OBT concept. I am definitely going to use this. Thanks Zach for this. You are an amazing engineer who teaches us real world scenarios 😍
This "sql sucks" example hits hard bruv.
Thanks Zach for amazing content
This is the exact video I am looking for! kudos for sharing great content.
thanks for the great content Zach!
Amazinggg…. Very in-depth, kept rewinding for parts I didn’t understand…. Got to know about this from your Twitter page…Please more or this🚀
Wow thank you this was super digestible and really helped me get a deeper understanding of modeling. This is really going to help me on the data science side of things.
Very insightful content! Especially the examples from your work like Facebook really help drill in the use of the various functions. Really appreciate the content :)
thanks for this, zach
Really appreciate your content man
thank you for the content, bring more Kimball please!
I like Kimball a lot, but I thought it was pretty funny when the group published "facing the re-keying crisis" near the time of his retirement and NOT a long time before that. That all said, I thought that group was the most humble and accepting of other methodologies.
this is was super helpful, learned a lot
@EcZachly_
2 ай бұрын
Glad you liked it! Any other topics you think I could cover?
@arashriazi9597
2 ай бұрын
@@EcZachly_ a video like this around big data would be very helpful as there are so many tools and approaches out there which makes it a bit difficult to enter to the big data space
Great content thank you Zach! Is there a reason why it cuts out at the end seemingly abruptly?
Thanks for the informative video on such a important (more so now than ever before) topic Zach! Question: why not just only keep the posts_obt table and upsert that when there's updates? Then you don't need the join
superrrr thanks!
Hey ! Great content as always, thank you very much ❤ So you and the community use the term "relational data model" to refer implicitly to "at least 3NF-compliant relational data model", right ? (strictly speaking, both dimensional models and OBTs are relational models too, right ?)
Oo great topic
Thanks for this video! It was very informative and the examples really illustrated the different use cases. Can you post a follow up or link to help clarify why the partitions are needed on SCD and OBT tables?
@mikekenneth4773
3 ай бұрын
I had same question. Because with partition_date in SCD, the number of records doesn't really reduce
This is fantastic content. Do you cover OBT in your data engineering boot camp? Where can I find the lab and lab demonstration you referred to in the video?
@EcZachly_
3 ай бұрын
We spend a week on SCDs and OBT yep!
This is good shit
Zach- one of the best tutorials I have seen giving a real world example. Quick question: 1) What does the partition_date for the posts_obt signify? Is it post creation date? 2) At 33:48 why do we have a filter in the query for partition_date. Here we are simply trying to find number of likes on a given post(using post_id). I am not able to think the role of a date column (same column in my first question)
@EcZachly_
Ай бұрын
1. No. Partition date is the date of the run of the pipeline 2. You need to pick a version of your OBT. Is it todays or yesterdays? You don’t overwrite here since that’s risky. Keeping 5-6 days of OBT data is advised.
@Escrotazo66
Ай бұрын
What is the point of saving partition date in kimpball? It is going to be equal either the start date if it is not true or the end date it it is true
Great Content Zach. Thanks. one question is to why we can't or shouldn't do CDC on source relational tables(May be by changed datetimestamp) to bring only new and updated source records rather than daily snapshots to build SCD type2 in DW layer?
@EcZachly_
3 ай бұрын
Every big tech company does daily snapshots. It’s duplicative but much simpler than CDC
Zach, what is your opinion on Bitemporal Modelling?
Where does the is_deleted column come from? How is it generated? Is there a custom logic behind it or some tooling that takes care of it?
Hi Zach great content! I am wondering how are you appending data in OBT. Are you using UPSERT or APPEND mode in your table? Is it on daily on hourly basis?
@EcZachly_
3 ай бұрын
Depends on the use case
Hi Zach, clear and practical sharing! May I know about transaction consistency across the snapshot tables? I wonder if the snapshots are in an inconsistent state when they are being selected & loaded to the data lake. For example, selecting "A" table at 5:00am, and selecting "B" table at 5:05 am with new updates. A join of 'A' & 'B' table would cause inconsistent result.
@EcZachly_
3 ай бұрын
Big tech generally doesn’t care about that. The way to avoid is minimize the time in between related table snapshots by dumping them in a single run
Why do you have field partition_date in user_scd and posts_scd?
3:55: Kimball
Hey zach! Great video, I really enjoyed it! In your experience, what's the best way to overcome the limitation of the maximum array size in some SQL engines? For example, in Amazon Athena (which uses Trino) the maximum array size is 254 (due to a Java limitation). In your example, a post could probably contain more thans these number of actions. One simple approach would be to make a new array column every time you hit the limit, but I don't think that is very scalable. What's your opinion?
@EcZachly_
Ай бұрын
Trino doesn’t have a 254 limit. I used trino at Facebook. The limit is 65k I thought? I definitely used Trino and 365 item arrays at Facebook. Maybe it’s an Athena thing?
Great talk! However I have couple questions. Even though partition_scd will be comparebly small, but wouldn’t its partitions grow infinitely? Or there going to be some rules to retain posts for the last n periods? And also I was wondering what are drawbacks of modelling posts as dim table? Probably I’m missing something but It more reminds me factless fact table.
@EcZachly_
3 ай бұрын
Yeah you only keep the last week of partitions around. Correct!
hi Zach, I have been working on SCD sfrom long, but I didnt understand partition_date column in your example. Is it the date of data ingestion in SCD table or sth else? For finding deleted number of posts in your example why would someone use partition_date where clause. Once a post is deleted it cant be undeleted. So why not simply Select count(*) from table where text = 'Sql_sucks' and is_deleted= True. And if i want to know how many posts where deleted on a specific day why would I again use partition_date, I will leverage start_date and end_date columns. Select count(*) from table where text = 'Sql_sucks' and is_deleted= True and start_date = current_date. That is the purpose of keeping these startdate and enddate columns in first place.
@EcZachly_
3 ай бұрын
Data lakes pull the data forward though. Partition_date here still has all the data since SCD type 2 is cumulative.
@mikekenneth4773
3 ай бұрын
@@EcZachly_Thanks, this clarifies my concern.So, the SCD will still de-duplicate the Data but hold the last partition_date of that record.
Hey zach thanks for your guidance it was super interesting and now I fully understand the advantage of obt's. do you have a tip how to translate this query into snowflake?
@EcZachly_
3 ай бұрын
They have similar array functions in snowflake. ChatGPT is your friend
@user-to4md9xm2d
3 ай бұрын
@@EcZachly_ haha :D chatgpt hates me and always produces grouped by queries but I will proceed :)
@tobelesa
4 күн бұрын
@@EcZachly_ yo, data engineer working with snowflake here. And it seems that snowflake does not handle array map and reduce very well. It does not have built-in array functions for array map and reduce
Can you explain more and put a youtube shorts or a video about the enrollment and what it covers it will benefit whom e.t.c Full-access Live Boot Camp starting May 6th ($2000) Analytics-Focused Live Boot Camp starting May 6th ($1650) Self-paced Data Engineering Course V4 Combined ($1750) Infrastructure-Focused Live Boot Camp starting May 6th ($1650)
@babuganesh2000
3 ай бұрын
I am amazed about your youtube content, couldn't stop watching your shots FACEPAD - Funnel, Aggregatioon, Clustering, Experimentation,Prediction,Accumulation, Derivative - you are an amazing teacher mentor
Couldn't all 1 to 1 relations to scd in reality just reside in the OBT? For example here. All posts table data could really just be another column of arrays.
@EcZachly_
Ай бұрын
Glad this got you thinking
so basically in the end are you saying that the interactions of millions of users on apps such as facebook or instagram , i am repeating again millions of users are stored in one big table ? imagine the number of rows on that table 😧😧
@EcZachly_
18 күн бұрын
Export it to excel and watch your computer melt before your eyes
20:55 what if the post contains PII? Does Facebook anonymise post content as well?
@EcZachly_
3 ай бұрын
Yep scrubs it of PII
Hi Zach i couldn't find your git hub things can you please provide that i will learn the things from that as you mentioned in the video
@EcZachly_
3 ай бұрын
The GitHub I use is www.github.com/DataEngineer-io There is alot of content on www.DataEngineer.io that you can watch that isn’t on KZread as well if you make a free account.
@srinubathina7191
3 ай бұрын
Sure Zach Thanks for your support
23:30 instead of snapshots why not do change data capture?
@EcZachly_
2 ай бұрын
CDC isn't as reliable
@berellevy2
2 ай бұрын
@@EcZachly_ and it sounds like most companies dont need the added granularity…
@TheEjaay
Ай бұрын
@@EcZachly_ ? care to explain
appreciate the content Zach but the feed just spontaneously cuts out.
@EcZachly_
3 ай бұрын
This was a live that I learned had no audio.
"Give me all the likes that happened between 4:20PM & 4:25PM each day to catch the stoned people..."
@EcZachly_
3 ай бұрын
One big table followed by one big joint
@jameshizon4861
3 ай бұрын
@@EcZachly_ lol
there are people who hate SQL?👀
@abacusyou
Ай бұрын
With a passion!