How to pick between Kimball, One Big Table, and Relational Modeling as a data engineer

We'll be covering:
- When to use One Big Table modeling vs Kimball
- How to use Struct and Array and Array of Struct to get what you want
Make sure you have a DataExpert.io account to get the most out of this session.
Join www.DataExpert.io/classroom/z... to get the queries!

Пікірлер: 89

  • @sf-spark129
    @sf-spark129Ай бұрын

    Literally this is the ONLY video on KZread on how to data model with a hands-on, real-world example. All other videos talk about concepts, which sounds very abstract. BTW, this guy seems high but how come he can be this sharp at the same time? amazing

  • @chandanjha3205

    @chandanjha3205

    Ай бұрын

    When he is high, he chooses right data model. When he is not, he simply dumps anything in a DB and ask analytics to figure out

  • @nileshk611
    @nileshk6112 ай бұрын

    41 mins of absolute data modeling concepts explained.

  • @shubhamsharma-ne2ke
    @shubhamsharma-ne2ke2 ай бұрын

    Wow. What a video. Randomly popped in feed and never completed a random 40 min video in one go.

  • @EcZachly_

    @EcZachly_

    2 ай бұрын

    Glad you liked it!

  • @shubhamsharma-ne2ke

    @shubhamsharma-ne2ke

    2 ай бұрын

    @@EcZachly_ loved it. Planning to take your course soon. Need to brush up the prerequisites:)

  • @scottbrwnng
    @scottbrwnng3 ай бұрын

    Please upload more of these long form videos that go in depth

  • @shraddhathapa9446
    @shraddhathapa9446Ай бұрын

    Hi Zach, please keep making these, your videos are incredibly insightful.

  • @NeumsFor9
    @NeumsFor93 ай бұрын

    Bro, when prod data isn't modeled correctly, asking a data engineer to address all data quality issues is like asking your plumber to get you clean water. That all said, the right integration of ETL Process, Data Quality, and Business metadata (and competent boss that can support you in doing so) can help solve many DQ problems. Populating that metadata repo requires lots of end user interaction and, if your company is mature, data governance.

  • @arpittapwal4651
    @arpittapwal4651Ай бұрын

    Thank you very much Zach for explaining these concepts with real life examples. This content is gold mine and extremely helpful. Please keep making more and more such informative and insightful videos 😃

  • @manuelalejandropereztrujil2391
    @manuelalejandropereztrujil23913 ай бұрын

    Wonderful content, Zach! God bless you!

  • @rishadm1771
    @rishadm1771Ай бұрын

    An absolute gem of a video!

  • @pcl1923
    @pcl19232 ай бұрын

    This is one of the best dimensional data modelling primers out there. Using it for my Meta DE onsite, and it has helped me to clearify some concepts, that I thought I "kind of" understood but with this video now I clearly see the reasonign and real applications. Thank you!

  • @TechyBuild
    @TechyBuild3 ай бұрын

    Once again you killed it Zach! You just made my data engineering journey easy.

  • @Milhouse77BS
    @Milhouse77BS3 ай бұрын

    Glad you were able to get a version out with sound. Thanks.

  • @EcZachly_

    @EcZachly_

    3 ай бұрын

    Thanks! I felt so bad that the stream was silent

  • @murugesanrajasekaran5032

    @murugesanrajasekaran5032

    3 ай бұрын

    @@EcZachly_no worries. You’re doing a wonderful job for the DE community. Much appreciated

  • @raphaeldayan
    @raphaeldayan27 күн бұрын

    This video is priceless, thank you for sharing

  • @anandahs6078
    @anandahs60783 ай бұрын

    😮 amazed by this OBT concept. I am definitely going to use this. Thanks Zach for this. You are an amazing engineer who teaches us real world scenarios 😍

  • @matthewd4972
    @matthewd49723 ай бұрын

    This "sql sucks" example hits hard bruv.

  • @srinubathina7191
    @srinubathina71913 ай бұрын

    Thanks Zach for amazing content

  • @medon910
    @medon9102 ай бұрын

    This is the exact video I am looking for! kudos for sharing great content.

  • @fzrbigman
    @fzrbigman3 ай бұрын

    thanks for the great content Zach!

  • @oyindamolavictor
    @oyindamolavictor2 ай бұрын

    Amazinggg…. Very in-depth, kept rewinding for parts I didn’t understand…. Got to know about this from your Twitter page…Please more or this🚀

  • @Tyf13
    @Tyf13Ай бұрын

    Wow thank you this was super digestible and really helped me get a deeper understanding of modeling. This is really going to help me on the data science side of things.

  • @muhammadraza3290
    @muhammadraza32903 ай бұрын

    Very insightful content! Especially the examples from your work like Facebook really help drill in the use of the various functions. Really appreciate the content :)

  • @farahiyahsyarafina2183
    @farahiyahsyarafina21833 ай бұрын

    thanks for this, zach

  • @mridulbhardwaj3116
    @mridulbhardwaj31163 ай бұрын

    Really appreciate your content man

  • @Levy957
    @Levy9573 ай бұрын

    thank you for the content, bring more Kimball please!

  • @NeumsFor9
    @NeumsFor93 ай бұрын

    I like Kimball a lot, but I thought it was pretty funny when the group published "facing the re-keying crisis" near the time of his retirement and NOT a long time before that. That all said, I thought that group was the most humble and accepting of other methodologies.

  • @arashriazi9597
    @arashriazi95972 ай бұрын

    this is was super helpful, learned a lot

  • @EcZachly_

    @EcZachly_

    2 ай бұрын

    Glad you liked it! Any other topics you think I could cover?

  • @arashriazi9597

    @arashriazi9597

    2 ай бұрын

    @@EcZachly_ a video like this around big data would be very helpful as there are so many tools and approaches out there which makes it a bit difficult to enter to the big data space

  • @strawhatSEO
    @strawhatSEO3 ай бұрын

    Great content thank you Zach! Is there a reason why it cuts out at the end seemingly abruptly?

  • @Joey-vy1iu
    @Joey-vy1iu3 ай бұрын

    Thanks for the informative video on such a important (more so now than ever before) topic Zach! Question: why not just only keep the posts_obt table and upsert that when there's updates? Then you don't need the join

  • @thiffanyp
    @thiffanyp2 ай бұрын

    superrrr thanks!

  • @Enzo-jj5te
    @Enzo-jj5te3 ай бұрын

    Hey ! Great content as always, thank you very much ❤ So you and the community use the term "relational data model" to refer implicitly to "at least 3NF-compliant relational data model", right ? (strictly speaking, both dimensional models and OBTs are relational models too, right ?)

  • @jameshizon4861
    @jameshizon48613 ай бұрын

    Oo great topic

  • @tadbronson
    @tadbronson3 ай бұрын

    Thanks for this video! It was very informative and the examples really illustrated the different use cases. Can you post a follow up or link to help clarify why the partitions are needed on SCD and OBT tables?

  • @mikekenneth4773

    @mikekenneth4773

    3 ай бұрын

    I had same question. Because with partition_date in SCD, the number of records doesn't really reduce

  • @kentmaxwell1976
    @kentmaxwell19763 ай бұрын

    This is fantastic content. Do you cover OBT in your data engineering boot camp? Where can I find the lab and lab demonstration you referred to in the video?

  • @EcZachly_

    @EcZachly_

    3 ай бұрын

    We spend a week on SCDs and OBT yep!

  • @datbooii
    @datbooiiАй бұрын

    This is good shit

  • @chandanjha3205
    @chandanjha3205Ай бұрын

    Zach- one of the best tutorials I have seen giving a real world example. Quick question: 1) What does the partition_date for the posts_obt signify? Is it post creation date? 2) At 33:48 why do we have a filter in the query for partition_date. Here we are simply trying to find number of likes on a given post(using post_id). I am not able to think the role of a date column (same column in my first question)

  • @EcZachly_

    @EcZachly_

    Ай бұрын

    1. No. Partition date is the date of the run of the pipeline 2. You need to pick a version of your OBT. Is it todays or yesterdays? You don’t overwrite here since that’s risky. Keeping 5-6 days of OBT data is advised.

  • @Escrotazo66

    @Escrotazo66

    Ай бұрын

    What is the point of saving partition date in kimpball? It is going to be equal either the start date if it is not true or the end date it it is true

  • @princebansal7500
    @princebansal75003 ай бұрын

    Great Content Zach. Thanks. one question is to why we can't or shouldn't do CDC on source relational tables(May be by changed datetimestamp) to bring only new and updated source records rather than daily snapshots to build SCD type2 in DW layer?

  • @EcZachly_

    @EcZachly_

    3 ай бұрын

    Every big tech company does daily snapshots. It’s duplicative but much simpler than CDC

  • @AnnChu-tb4hp
    @AnnChu-tb4hp21 күн бұрын

    Zach, what is your opinion on Bitemporal Modelling?

  • @muhammadraza3290
    @muhammadraza32903 ай бұрын

    Where does the is_deleted column come from? How is it generated? Is there a custom logic behind it or some tooling that takes care of it?

  • @krzysztoflinke3870
    @krzysztoflinke38703 ай бұрын

    Hi Zach great content! I am wondering how are you appending data in OBT. Are you using UPSERT or APPEND mode in your table? Is it on daily on hourly basis?

  • @EcZachly_

    @EcZachly_

    3 ай бұрын

    Depends on the use case

  • @tonyyeung9481
    @tonyyeung94813 ай бұрын

    Hi Zach, clear and practical sharing! May I know about transaction consistency across the snapshot tables? I wonder if the snapshots are in an inconsistent state when they are being selected & loaded to the data lake. For example, selecting "A" table at 5:00am, and selecting "B" table at 5:05 am with new updates. A join of 'A' & 'B' table would cause inconsistent result.

  • @EcZachly_

    @EcZachly_

    3 ай бұрын

    Big tech generally doesn’t care about that. The way to avoid is minimize the time in between related table snapshots by dumping them in a single run

  • @AsifKhan-jq8zg
    @AsifKhan-jq8zg21 күн бұрын

    Why do you have field partition_date in user_scd and posts_scd?

  • @Milhouse77BS
    @Milhouse77BS3 ай бұрын

    3:55: Kimball

  • @Sfgarcia
    @SfgarciaАй бұрын

    Hey zach! Great video, I really enjoyed it! In your experience, what's the best way to overcome the limitation of the maximum array size in some SQL engines? For example, in Amazon Athena (which uses Trino) the maximum array size is 254 (due to a Java limitation). In your example, a post could probably contain more thans these number of actions. One simple approach would be to make a new array column every time you hit the limit, but I don't think that is very scalable. What's your opinion?

  • @EcZachly_

    @EcZachly_

    Ай бұрын

    Trino doesn’t have a 254 limit. I used trino at Facebook. The limit is 65k I thought? I definitely used Trino and 365 item arrays at Facebook. Maybe it’s an Athena thing?

  • @KPYCHIK
    @KPYCHIK3 ай бұрын

    Great talk! However I have couple questions. Even though partition_scd will be comparebly small, but wouldn’t its partitions grow infinitely? Or there going to be some rules to retain posts for the last n periods? And also I was wondering what are drawbacks of modelling posts as dim table? Probably I’m missing something but It more reminds me factless fact table.

  • @EcZachly_

    @EcZachly_

    3 ай бұрын

    Yeah you only keep the last week of partitions around. Correct!

  • @ManishJindalmanisism
    @ManishJindalmanisism3 ай бұрын

    hi Zach, I have been working on SCD sfrom long, but I didnt understand partition_date column in your example. Is it the date of data ingestion in SCD table or sth else? For finding deleted number of posts in your example why would someone use partition_date where clause. Once a post is deleted it cant be undeleted. So why not simply Select count(*) from table where text = 'Sql_sucks' and is_deleted= True. And if i want to know how many posts where deleted on a specific day why would I again use partition_date, I will leverage start_date and end_date columns. Select count(*) from table where text = 'Sql_sucks' and is_deleted= True and start_date = current_date. That is the purpose of keeping these startdate and enddate columns in first place.

  • @EcZachly_

    @EcZachly_

    3 ай бұрын

    Data lakes pull the data forward though. Partition_date here still has all the data since SCD type 2 is cumulative.

  • @mikekenneth4773

    @mikekenneth4773

    3 ай бұрын

    ​@@EcZachly_Thanks, this clarifies my concern.So, the SCD will still de-duplicate the Data but hold the last partition_date of that record.

  • @user-to4md9xm2d
    @user-to4md9xm2d3 ай бұрын

    Hey zach thanks for your guidance it was super interesting and now I fully understand the advantage of obt's. do you have a tip how to translate this query into snowflake?

  • @EcZachly_

    @EcZachly_

    3 ай бұрын

    They have similar array functions in snowflake. ChatGPT is your friend

  • @user-to4md9xm2d

    @user-to4md9xm2d

    3 ай бұрын

    @@EcZachly_ haha :D chatgpt hates me and always produces grouped by queries but I will proceed :)

  • @tobelesa

    @tobelesa

    4 күн бұрын

    ​@@EcZachly_ yo, data engineer working with snowflake here. And it seems that snowflake does not handle array map and reduce very well. It does not have built-in array functions for array map and reduce

  • @babuganesh2000
    @babuganesh20003 ай бұрын

    Can you explain more and put a youtube shorts or a video about the enrollment and what it covers it will benefit whom e.t.c Full-access Live Boot Camp starting May 6th ($2000) Analytics-Focused Live Boot Camp starting May 6th ($1650) Self-paced Data Engineering Course V4 Combined ($1750) Infrastructure-Focused Live Boot Camp starting May 6th ($1650)

  • @babuganesh2000

    @babuganesh2000

    3 ай бұрын

    I am amazed about your youtube content, couldn't stop watching your shots FACEPAD - Funnel, Aggregatioon, Clustering, Experimentation,Prediction,Accumulation, Derivative - you are an amazing teacher mentor

  • @TheEjaay
    @TheEjaayАй бұрын

    Couldn't all 1 to 1 relations to scd in reality just reside in the OBT? For example here. All posts table data could really just be another column of arrays.

  • @EcZachly_

    @EcZachly_

    Ай бұрын

    Glad this got you thinking

  • @mihirit7137
    @mihirit713718 күн бұрын

    so basically in the end are you saying that the interactions of millions of users on apps such as facebook or instagram , i am repeating again millions of users are stored in one big table ? imagine the number of rows on that table 😧😧

  • @EcZachly_

    @EcZachly_

    18 күн бұрын

    Export it to excel and watch your computer melt before your eyes

  • @keremen
    @keremen3 ай бұрын

    20:55 what if the post contains PII? Does Facebook anonymise post content as well?

  • @EcZachly_

    @EcZachly_

    3 ай бұрын

    Yep scrubs it of PII

  • @srinubathina7191
    @srinubathina71913 ай бұрын

    Hi Zach i couldn't find your git hub things can you please provide that i will learn the things from that as you mentioned in the video

  • @EcZachly_

    @EcZachly_

    3 ай бұрын

    The GitHub I use is www.github.com/DataEngineer-io There is alot of content on www.DataEngineer.io that you can watch that isn’t on KZread as well if you make a free account.

  • @srinubathina7191

    @srinubathina7191

    3 ай бұрын

    Sure Zach Thanks for your support

  • @berellevy2
    @berellevy22 ай бұрын

    23:30 instead of snapshots why not do change data capture?

  • @EcZachly_

    @EcZachly_

    2 ай бұрын

    CDC isn't as reliable

  • @berellevy2

    @berellevy2

    2 ай бұрын

    @@EcZachly_ and it sounds like most companies dont need the added granularity…

  • @TheEjaay

    @TheEjaay

    Ай бұрын

    @@EcZachly_ ? care to explain

  • @mitchtherighteous
    @mitchtherighteous3 ай бұрын

    appreciate the content Zach but the feed just spontaneously cuts out.

  • @EcZachly_

    @EcZachly_

    3 ай бұрын

    This was a live that I learned had no audio.

  • @jameshizon4861
    @jameshizon48613 ай бұрын

    "Give me all the likes that happened between 4:20PM & 4:25PM each day to catch the stoned people..."

  • @EcZachly_

    @EcZachly_

    3 ай бұрын

    One big table followed by one big joint

  • @jameshizon4861

    @jameshizon4861

    3 ай бұрын

    @@EcZachly_ lol

  • @mx953
    @mx953Ай бұрын

    there are people who hate SQL?👀

  • @abacusyou

    @abacusyou

    Ай бұрын

    With a passion!