Jordan has no life
2 ай бұрын
8,208
1

17: Top K Leaderboard | Systems Design Interview Questions With Ex-Google SWE

Ғылым және технология

Anyone want to guess the top 5 on Jordan DMs from women? It's a trick question

Пікірлер: 78

@idiot7leon2 ай бұрын
Brief Outline 00:00:23 Find the Top K... 00:01:16 Problem Requirement 00:02:16 Overall Approach 00:03:23 Precise Solution 00:06:11 Counting Phase 00:08:57 Local Top K Step 00:10:51 Global Top K 00:12:27 Speeding Things Up - Approximation 00:15:07 Windowed Top K, How To Calculate 00:18:10 Windowed Top K, How To Read 00:19:45 Publishing Faster Results 00:21:11 Count Min Sketch 00:23:19 Count Min Sketch - Continued 00:26:01 Count Min Sketch - Continued 00:27:44 Count Min Sketch - Fault Tolerance 00:28:48 Final Diagram - Top K Problem Thanks, Jordan~
@jordanhasnolife5163
2 ай бұрын
Thank you!!
@mimanshabhargav3266Ай бұрын
I got this problem for my on-site today. Thanks to you I had some idea on how to do it.
@skinnycucumber2 ай бұрын
i have my first system design interview tomorrow and i've been cramming all your videos. great content!
@jordanhasnolife5163
2 ай бұрын
Good luck, let us know how it goes!
@rakeshvarma809126 күн бұрын
As usual, great video Jordan.
@aliz1662 ай бұрын
I love your videos man! Any chance you might cover an ad click aggregation system soon? It's a bit similar to this actually, but of course has its own nuances. Thanks again for your help!
@jordanhasnolife5163
2 ай бұрын
Hey Ali! I'd just watch this one and the tinyurl video, curious to hear what else you feel this incorporates beyond that
@SunilKumar-qh2ln2 ай бұрын
finally this problem, Giga Chad of system design for a reason
@jordanhasnolife5163
2 ай бұрын
With great power (a 3 inch schlong) comes great responsibility
@Anonymous-ym6st2 ай бұрын
Thanks for the great contents as always! One quick question (I might miss from the video): why we don't store more aggregated data (more than the top k for every hour, but the full data) in the approximation + fast solution? It won't be more data than the accurate one right? And it won't sacrifice the accuracy? Or we mean the trade-off of fast read is not when we pre-calculate, but when we query for, say that 3 hour, sum them up more than the every hour's top k would be the expensive?
@jordanhasnolife5163
2 ай бұрын
Having a little bit of trouble understanding you but: 1) That's a ton of data to store 2) Now to calculate the top k over all of those windows I need to aggregate the counts for every event, which is slow!
@truptijoshi253511 күн бұрын
Hi Jordan! I have some questions. 1. For the precise solution, the spark node aggregates the event IDs. Why are we aggregating it again in Hadoop? 2. After aggregating, we are doing local top k and then global top k. Suppose the local top k of 3 nodes look like this : 1000->999->998->997->996 etc 100->99->98->97 200->300->400 If we are only taking top 3 in each node, we might be missing out some of the bigger values like 997->996 right?
@jordanhasnolife5163
10 күн бұрын
1) The stream consumer is just putting tuples of (eventId, timestamp) into hadoop via parquet files. 2) We do a local top k after aggregating as you mentioned, which means we did a shuffle to get all events with the same id on the same node. So we have the full counts of each event. Then, once we get the local top k of the eventIds that were hashed to a given node, we can do a global top k.
@kevinding0218Ай бұрын
Thanks for the great content! I have a question about maintaining a K-size min-heap on a Spark streaming node. Since this is a streaming process, it's difficult to know the exact count of a specific keyword in real time as I assume the count may increase over time. If we only maintain a K-size heap, could we end up losing earlier counts for some keywords? Please correct me if I'm wrong.
@jordanhasnolife5163
Ай бұрын
Yup you can't do this if you have active incoming counts - you'd need to keep a sorted list (or treeset or something) of all incoming events to be precise
@vincentchee41082 ай бұрын
Thanks for the video. I'm a bit curious the need of a tsdb. I assume a Postgres DB with time range partition would suffice the use case. With time range partition, you can still do: WITH cte AS (SELECT event_id, SUM(count) AS sum FROM table WHERE timestamp BETWEEN t1 and t2 GROUP BY event_id) SELECT * FROM cte ORDER BY cte.sum LIMIT k
@jordanhasnolife5163
2 ай бұрын
I imagine this would work, but would be curious to see how the benchmarks actually look between them
@monihazarika56282 ай бұрын
I really liked the content and the way technical concepts around building approx vs real time leaderboard considerations were made. I am looking for some insights on how the fantasy gaming platform like Dream11 works and it will be a great video for your india fans Jordan !!
@jordanhasnolife5163
2 ай бұрын
Thanks for the kind words! I've never heard of dream11 so I'd have to do some research there!
@harshbhatt_2 ай бұрын
Hi Jordan, love the content! Quick question - how would this solution change if we need to find the top K for each user? Ex: Rather than finding the top 10 songs globally in the past 24 hours; we want to find the top 10 songs for each user in past 1day/7days/etc? (No complicated ML model; just based on counts of each song played)
@jordanhasnolife5163
2 ай бұрын
Makes sense! This probably becomes a bit easier then actually in the sense that you could potentially store all of this data on a single node per user. Basically, we can sink all data to a time series database, partitioning it by user id and ordered by timestamp. Then we can just do the exact computation, since we're hitting just one node at a time.
@saber3112
2 ай бұрын
we can ue flink's windowing capabilities to create time windows for the specified time ranges (e.g., 1 day, 7 days). Aggregate the data within each window to compute the count of each song played by each user. Now for each window and each user, compute the top N songs based on their play count.
@mcee3112 ай бұрын
In the time series DB how are you differentiating the results from count min sketch versus the ones from spark streaming? Does the spark one just overrides the count min sketch?
@jordanhasnolife5163
2 ай бұрын
I think you could do either implementation but considering the count min sketch data is pretty small I figure you may as well keep it around
@parthpanchmatia2242 ай бұрын
Hi Jordan, what iPad app do you use to draw the system design diagrams and present them? I have an iPad so this info will help me put my system design learnings to test
@jordanhasnolife5163
2 ай бұрын
I wanna say one note
@harman65432111 күн бұрын
i think you can use redis to store intermediate count min sketch on each flink job, and aggregrate
@jordanhasnolife5163
10 күн бұрын
Can you elaborate on this? I'm not sure in the flow of the problem where it belongs.
@jiananstrackjourney37021 сағат бұрын
Thanks Jordan, I really appreciate the video. One question, is there any specific reason the kafka has to partition on event id? Even if it is not, the downstream spark streaming will still be able to count and aggregate the data, right?
@jordanhasnolife5163
6 сағат бұрын
Yes but it's going to take a lot less data shuffling if we partition by event Id
@MrBrown782 ай бұрын
Got an interview today, been spamming your videos and they've been really helpful! Hopefully I can perform tho XD
@jordanhasnolife5163
2 ай бұрын
Best of luck! Make sure to pop a Viagra before
@MrBrown78
2 ай бұрын
@@jordanhasnolife5163 Popped them shits like they were fruit gummies but ended up getting rejected. Man fuck, time to drink the pain away🥳
@jordanhasnolife5163
2 ай бұрын
@@MrBrown78 lol no worries, what was the problem and how do you feel it could have gone better
@MrBrown78
2 ай бұрын
@@jordanhasnolife5163 Asked me to design a system to allow employees within an organization to connect with those they normally wouldn't contact. I aimed for a simple messaging platform because I didn't really understand what they were looking for and they said it was supposed to be built with a timeline of 2 weeks. I approached it same way you designed whatsapp, but I think I forgot some design choices during it.
@hrjhonor2 ай бұрын
Great video! Two questions: what's the downside of writing the first spark streaming which outputs local top k results directly to the time series db, given the TSDB has many aggregation capabilities built in? Other question is is it possible to use redis + cdc/kafka, where real time result is computed from redis?
@hrjhonor
2 ай бұрын
to extend the first question, why can't we use the TSDB in place of the HDFS? Is it mainly concerning the ingestion rate?
@jordanhasnolife5163
2 ай бұрын
1) You actually probably could do all of the writes to the time series database, and then run a big query on it! But keep in mind that we'd then have to store each of the events with their timestamps in the tsdb, which is a lot of data. Also, I do wonder (maybe it wouldn't) if the TSDB would try to perform some locking for a read only query (hadoop definitely wouldn't as the files are immutable). Great point though - I also do think that a spark job allows us to be specific for how we want to do the counting and joins, whereas I'm not sure how optimized a TSDB query would be for this. 2) Yeah you could probably just put all of the local top k counts in the tsdb, but I'd hope that you'd only actually store the global top K (meaning the local top k's should overwrite one another). Otherwise we have ourselves a pretty big storage cost. I was thinking about the overwriting of data though to merge sorted lists, and I felt that in a db on disk that's probably quite a bit slower than just doing this in memory and then sinking it to a db.
@hrjhonor
2 ай бұрын
Thanks for the thoughts!
@dinar.mingaliev2 ай бұрын
hi Jordan, hope you still have more DM than Megan :) Could you please clarify how data from count min sketch goes to time-series. Basically it is a matrix of dimension # of hash function X modulo we use for hashing. The bigger the matrix - more numbers we can store there. Kinda dictionary. But it does not store keys - which is our event type for which we want to count how many times it appeared. How to reconstruct those IDs from min-count sketch?
@dinar.mingaliev
2 ай бұрын
for example if we keep in count-min sketch counts of type of videos we need to know all possible types and query them. if we keek ids of video to approximately count counts of every video we have to again query our CM sketch as many times as videos. Right? we can also shard count min sketch :)
@jordanhasnolife5163
2 ай бұрын
Hi Dinar! In the export to TSDB phase, we need to basically loop over every possible ID, and get a count for it, and then export that count to the tsdb. We could always pull these Ids from a database or something.
@jordanhasnolife5163
2 ай бұрын
Good point though! It may be nice to have a second set of nodes listening to Kafka to ensure that we have a list of all the keys in some sort of database
@harshitnarang63542 ай бұрын
Chad content, thanks Jordan ! Will it be okay for you to share the notes ?
@jordanhasnolife5163
2 ай бұрын
Thanks Chad! I'll share them, but planning to do so in batch in a few weeks. I'm out of the country at the moment so I don't have much time to export them.
@GuntherJones2 ай бұрын
For the top k don't you want a max heap? Thank you so much for these.
@jordanhasnolife5163
2 ай бұрын
Nope min heap! If I want to get the top k I want to quickly compare with the smallest element of the current top k
@andrewcisternino82717 күн бұрын
Hi Jordan... So I'm under the impression the time series db is storing top k per hr on each chunk. Yet, your slide for consumption via leaderboard service has us using a hashmap to aggregate from 3 separate hrs. Wouldn't the consumption be same process as 'Global top k' utilizing the 'merge sorted lists' leetcode problem since we are pulling 3 separate top k lists (1 per hr)? I guess I'm also wondering the format TSDB is storing data. Thanks for videos.
@jordanhasnolife5163
3 күн бұрын
This is a bit different here because the top K from those hour intervals likely overlap so you'll need to use a hashmap to merge the total counts together first. Then from there, nothing is sorted, so we'll just have to use a heap of size k and loop over the events to figure out the new top k in the 3 hour interval.
@andrewcisternino8271
2 күн бұрын
@@jordanhasnolife5163 This makes sense. Thanks.
@andreybraslavskiy522Ай бұрын
Hi Jordan, thank you for the great video. Can you please help me to understand how sketch min and spark jobs works together? Do we use sketch min only for the current not finished hour and as soon as spark job finish it overwrites sketch min result as more precise? Do we reset sketch min every hour ? If user query for the last 3 hour, does it mean that spark job results always has available for previous hours and we don’t need result from sketch min at all ? Actually I can’t get the scenario when data from sketch min is used, should it be combined with spark results somehow ?
@jordanhasnolife5163
Ай бұрын
I'd think which implementation/how you combine them comes down to the application itself. It's your choice whether this is a worthy enough problem to even run a spark job for. If it is, it depends how often you run it. I think having a count min sketch available just makes the hour by hour computation a bit cheaper compared to something like flink where we actually would perform the counts on a per hour basis. I don't think you'd really ever combine the results from spark, if you have the spark results, you have the real answer.
@andreybraslavskiy522
Ай бұрын
@@jordanhasnolife5163 Thank you!
@gabbah792 ай бұрын
Love the content. Just which you would use more legible font, it really slows down my reading. 😢
@jordanhasnolife5163
2 ай бұрын
Sorry will try to work on this and make it a bit more clear
@tunepa441818 күн бұрын
Thanks for the video. How do we get top k count from count min sketch, I assume count min sketch is just for counting not to get top k
@jordanhasnolife5163
17 күн бұрын
In theory you can maintain an in memory heap of size k with the counts of the top k elements as you compute the count of each while using the count min sketch
@Henry-wq7ki2 ай бұрын
Why count min sketch server in addition to global top k in the time series DB? They are redundant right ?
@jordanhasnolife5163
2 ай бұрын
In theory the count min sketch server would be able to expose metrics within a shorter period of time, albeit with less accuracy than the global top k
@varunigarg649223 күн бұрын
Why can’t we maintain state in flink for some x hours and run the aggregation there itself - this would surpass the need to have spark streaming right?
@jordanhasnolife5163
22 күн бұрын
Well what about after those few hours? How do we combine hourly intervals to get the top k over the aggregate?
@varunigarg6492
22 күн бұрын
In flink, we can create 2 process functions : 1- which computes top N at an hourly cadence (can be used to serve fast viz use cases via Druid for example) and 2- computes top N from last 24 hourly computed results. Eviction policy is time based. To store state we can make use of Rocks DB state backend that stores state as serialized bytes on disk with an in memory cache. So all functionalities encapsulated within Flink.
@jordanhasnolife5163
21 күн бұрын
@@varunigarg6492 Sure but these aren't accurate results. If I want to know the top K for the last week, how would I do that? You could combine the top k from each smaller interval within that week, but you then don't consider that an event that wasn't included from the top k of each mini interval could be in the total top k
@oakvillian52 ай бұрын
Bro couldn't stay away from amsterdam
@jordanhasnolife5163
2 ай бұрын
Bro had to go back for his top two hobbies
@explorer90042 ай бұрын
let's get to Google bros
@jordanhasnolife5163
2 ай бұрын
Best of luck my friend - just remember don't harp on one dream company and take the best offer you get :)
@AP-eh6grАй бұрын
17:42 each listening to a partition on a kafka broker to be more precise I guess..? correct me if wrong
@jordanhasnolife5163
Ай бұрын
that works
@aa-kj5xi2 ай бұрын
any plans to start a merch shop for feet pics, farts, or bath water?
@jordanhasnolife5163
2 ай бұрын
Yeah thinking bath water may scale nicely. Any thoughts on pricing? Currently going for $10, and $20 for my post lift bath
@Hangar1318
10 күн бұрын
If it doesn't scale, you could always shart or fartition.
@AP-eh6grАй бұрын
red light district 15 times 😄 someones an addict, jk
@loshmiification2 ай бұрын
kzread.info/dash/bejne/oIWkzbSIft2rgaw.html There is a 1 consumer per Kafka partition limit in existence. You either are not sharding Kafka using partitions, or you cannot have more then one consumer reading there. Care to elaborate more on it? (Either on the sharding mechanism, or the consumer limit). Thank you. Great content otherwise!
@jordanhasnolife5163
2 ай бұрын
To be fair, I am using one consumer per Kafka partition
@jordanhasnolife5163
2 ай бұрын
If you're referring to state machine replication the count min sketch replicas could always read from the replica partitions
@kbman99
14 күн бұрын
Assuming you have a kafka topic with 1 partition, you can use 2 consumers as long as they are both set up with different consumer groups. In this case that would be what you want because you want both CMS state machines to receive the same data. In this manner, with different consumer groups, we are effectively implementing a one-to-many pub-sub.