15: Reddit Comments | Systems Design Interview Questions With Ex-Google SWE

Ғылым және технология

That chili put more load on my system than all of the upvotes and downvotes on all the comments for a single postId could put on a kafka queue

Пікірлер: 58

@FatherPhi3 ай бұрын
Congrats on 25k!!! You deserve it man
@jordanhasnolife5163
3 ай бұрын
Thanks Phi!!
@roshanzameer50203 ай бұрын
Another great video. Would love an occassional live system design session. Also, congratulations on the 25K!!!
@shandilyacodes3 ай бұрын
Binge watching all your content. It's content dense and fun to watch at the same time. Thanks for doing this!
@undefinedrandomness3 ай бұрын
Watched all your videos over the holidays and locked in my first faang offer. Just wanna say thanks for the goated content, and keep up the good work! I genuinely learned a ton from this channel. Congrats on 25k!📈📈
@jordanhasnolife5163
3 ай бұрын
You're HIM!! Thanks man, I appreciate it!
@SkinnyPaddy3 ай бұрын
100% this channel helped me get a FAANG offer last week. Cheers dude! ❤
@jordanhasnolife5163
3 ай бұрын
Congrats man!! You're a legend!
@andriidanylov94532 ай бұрын
It was a strong description for this task. Thank You
@antonvityazev482323 күн бұрын
Hello Jordan! Thanks for another good video with detailed explanation. I have a question though after watching it: Why don't we use Cassandra here instead of MySQL if we expect a huge load on writes?
@jordanhasnolife5163
23 күн бұрын
I wouldn't really say that I expect a huge load of writes, but moreso that I don't want to two phase commit a bunch of writes to four places. More importantly, reddit comments have parents which implies we need to maintain causal relationships which may be challenging in a leaderless setup.
@AP-eh6gr3 ай бұрын
could you share some example docs/links on how to implement the logic in the 37th minute that says - hey consume from a flink operator by the post+comment_id AND consume from CDC of the upvotesDB AND have Spark streaming process a comment and its corresponding score/upvotes only when comment's presence is true? Not having used these in practice makes it a bit hard to come up with OR do like a smallish tutorial on implementing this in a future video so that it feels more real, so to speak, lol Damn tho, your final solution is impressive, definitely what I was imagining as the most optimized way for upvotes by user and upvotes by post, who they came from (which cares about userid), and their aggregated counts (which cares about commentId and postId)
@jordanhasnolife5163
3 ай бұрын
Thanks! Just pseudocoding here but: HashMap def processVote(CommentId, Vote): if vote is upvote: map[commentid] + 1 else: map[commentId] -= 1 if map[commentId] has comment: updateDatabases(commentId, score) -> here we rearrange our graph edges def processComment(commentId): map[commentId][commentPresent] = true if map[commentId][score] != 0: updateDatabases(commentId, score)
@tfk93326 күн бұрын
Could the CDC into Spark be used to calculate comment scores for each ranking system with each incoming vote? e.g. each comment would include: {hot_score: x, controversial_score: y, ...}. Since there are so few comments per post, instead of a derived view, the client could grab all the comments on a post (maybe with exceptions for extremely popular posts), each with all the scores, and handle the sorting client side.
@jordanhasnolife5163
23 күн бұрын
Sure, this is a fair approach, but it depends how many comments there are. Doing client side sorting does imply doing a lot of repeat work client side, and for posts where we want to paginate the comments this now becomes unfeasible.
@hristianiliev79792 ай бұрын
Hey Jordan, great video. I believe that you did not explain one aspect in the final diagram though. What is this arrow from Spark Streaming to the New Comments/Source of truth DB?
@jordanhasnolife5163
2 ай бұрын
Sorry, brushed over that one! That's just to change the upvote/downvote count per comment in the source of truth table.
@Kevin-jt4oz3 ай бұрын
This is amazing. A system design video with enough algorithmic depth. Can you share any links where I can dive into this furhter?
@jordanhasnolife5163
3 ай бұрын
For this one unfortunately not really, was mostly shit I made up since I couldn't find any great resources for it. I think my design might not even be necessary for reddit's current needs, but hey it's a fun thought experiment.
@ariali20673 ай бұрын
Not related to this specific video though, Saga & TC/C are not covered in your system design concept series, do you think we need to learn about those for System Design Interviews? I do see several designs mentioned these two concepts when discussing distributed transactions, not sure whether I can just skip if not super important. Curious to hear your thoughts, thank you!
@jordanhasnolife5163
3 ай бұрын
Probably not but to be fair it's a pretty small topic you could probably learn it in like 10 minutes
@FoodpediaIndia3 ай бұрын
@jordan loving your videos, in lot of your videos you mentioned that have a cdc on database to get the up to date info to the flink while same flink is consuming from Kafka but let’s say if a user a recently did not have any activity on his users db and if the same user tweeted , you said now flink will have available with users friends from cdc and fanout his tweet to all his friends. In that case would flink pull details from the user db directly?
@jordanhasnolife5163
3 ай бұрын
Flink ideally should not have any TTL on that user data, so it should already be there. I suppose what you're describing is more of a caching mechanism, where after a while we drop the data, and then go fetch it from the db, which is also a viable option.
@scottyjacobson4655
3 ай бұрын
@@jordanhasnolife5163 I wonder if @factsfuture9778 is asking the same thing I was wondering yesterday, which is the need to possibly "pre-populate" Flink's key store from the db you'd be CDC-ing with, since it's likely there will be some users that Flink hasn't heard any events from the db about those users yet and thus doesn't know their state "pre-initialization" I asked ChatGPT about this, and he said to define an initialization step that "using Flink's JDBC connectors, execute[s] a query against the MySQL database and fetch the current [applicable state] of all users" is this common? / does flink have an idiom for this?
@idiot7leon2 ай бұрын
Brief Outline 00:00:51 Reddit Comments 00:01:58 Problem Requirements 00:03:37 Capacity Estimates 00:06:07 Main Consideration 00:07:56 "Source of Truth" Table 00:08:48 Base Table Options - Native Graph Database 00:12:22 Base Table Options - Use a normal database with an index 00:15:28 Relational vs Graph Based Approach for "New" Ordering 00:17:31 Top Comments 00:19:19 Graph DB for Top Comments 00:21:56 In Memory Graph for Top Comments 00:24:55 Controversial Comments 00:27:04 Hot Comments 00:29:41 Synchronous Updates vs Derived Data 00:31:07 Casual Consistency 00:33:58 Change Data Capture in Practice 00:35:54 Change Data Capture 00:38:42 Upvotes DB 00:41:16 Obligatory Caching Slide 00:53:12 Final Diagram - Reddit Comments Thanks, Jordan~
@savanatong70413 ай бұрын
hope references about the decision made in the video could be attached.
@jordanhasnolife5163
3 ай бұрын
Not too many about reddit comment views if you look I see one for materialized path (which I have discussed), one for celko trees (which are dumb) Other than that, I know the tradeoffs of graph databases, the tradeoffs of normalized vs. denormalized data, and also the tradeoffs of how you choose to partition within a stream processing framework.
@Onislayer3 ай бұрын
Hey Jordan, why use Spark Streaming and not just use flink for the whole setup with maintaining causal consistency on that second kafka topic?
@jordanhasnolife5163
3 ай бұрын
Hey! Just wanted to use microbatching there, we're just writing a bunch of upvotes and downvotes to a particular comment id, and then potentially reordering the edges of the graph. That could be expensive, so I think batching them up makes for a nice improvement!
@Onislayer
3 ай бұрын
also in this setup would you keep the linked list in spark streaming instead to update the edge weights on the hot db?
@jordanhasnolife5163
3 ай бұрын
@@Onislayer yep!
@vincentchee4108
3 ай бұрын
Flink can write in microbatch manner using window, but i agree it is not super intuitive.
@prashlovessamosa3 ай бұрын
Loved it.
@yrfvnihfcvhjikfjn3 ай бұрын
Next video I'm going to play the Jordan has no life drinking game where you have to take a shot every time he mentions flink
@jordanhasnolife5163
3 ай бұрын
I've gotta get them to sponsor me
@AP-eh6gr
3 ай бұрын
flink fetish
@aj_agr6 күн бұрын
Was wondering about the cost of keeping this in memory graph index for all the posts over the years and whether it's worth it to build a hot/cold tier sort of index. Half a billion posts per year * 100 comments per post * 10 years * 8bytes overhead per comment = 4TB of RAM. 4GB RAM in AWS with reserved instances = 200$ per year. So for 4TB, you pay 200K$. 100 comments per post on an average is a lot I feel - you can also reduce this 8 byte overhead to something less by making an assumption on the max # of comments per post - by a factor of 4. On a side note, persisting this index becomes potentially attractive to avoid having to re-stream all of this data in the event of index node failures. Maybe restreaming is not that bad from read replicas of comments and upvote DB.
@jordanhasnolife5163
4 күн бұрын
One of the reasons we use flink is so that we don't have to restream all data and can restore from a snapshot. I see your point and this is an interesting idea to only keep things in flink for x amount of time and otherwise recompute the data on a batch job or something!
@ramannanda7 күн бұрын
Question why not a separate topic and stream-stream join? rather than having two different producers publishing to the same kafka topic.
@jordanhasnolife5163
6 күн бұрын
I'm not sure which part of the design you're talking about can you elaborate?
@ramannanda
6 күн бұрын
@@jordanhasnolife5163 I think in the end you do show different kafka topics, but kzread.info/dash/bejne/dINmybSio9Gok6Q.html
@swanv9513 ай бұрын
For ordering by "Hot comments", if a comment doesn't get any activity in x mins then some periodic job needs to update its score. It needs to do that for each and every comment every x mins. How would that scale?
@jordanhasnolife5163
3 ай бұрын
Our flink node that is receiving upvotes and downvotes and is partitioned by (postId, commentId) is maintaining a linked list locally, and checks the head of the list every once in a while to see if the node is expired
@sinsaremine
2 ай бұрын
@@jordanhasnolife5163 I don't understand how flink has the memory to store all this data. Its 8MB of data per post, if there are 100ks of posts this can grow pretty quickly. How do you make sure that there is enough memory in flink to store all comment trees across all posts.
@joonlee38633 ай бұрын
hey jordan, can you make a video on designing a donation system? e.g. donating to list of charities
@jordanhasnolife5163
3 ай бұрын
Hey Joon, besides a normal server and database here what do you consider to be the hard part of this problem?
@joonlee3863
3 ай бұрын
@@jordanhasnolife5163 hey jordan, thanks for the reply. I just started studying system design. So topics like real-time data/money aggregation w/ high load/peak, how to reprocess failed payments, and how to avoid hot spots if one charity is popular, other various failure scenarios in an Async design, etc
@joonlee3863
3 ай бұрын
Also maybe going over little tedious stuff like API designs and DB schema/details for a noob like me? System design interviews seem to ask more in details about these than what I've seen in YT videos :( Apologies for a big ask
@jordanhasnolife5163
2 ай бұрын
@@joonlee3863 Thanks for elaborating! I think that for a problem like this, it probably would be employing a lot of the other design patterns that we come across in other problems that I've covered. For example, real time money aggregation is similar to the real time counting or real time order handling that I might cover in the tinyURL or Amazon retail case
@jordanhasnolife5163
2 ай бұрын
@@joonlee3863 As for this one, totally makes sense, I try to go over these where they are relevant and spent more time on them in my 1.0 series. That being said, I think that once you see the pattern in these, covering them in each video can take a decent amount of time and may also feel a bit redundant.
@mohamedmacow361014 күн бұрын
Can you do a video on API gateways
@jordanhasnolife5163
13 күн бұрын
Will give it some consideration if there's anything non trivial to be done there
@davendrasingh88462 ай бұрын
why not simply use redis sorted set for top, controversial post ?
@jordanhasnolife5163
2 ай бұрын
Well you could use sorted sets within sorted sets within sorted sets, at which point you've just built out the data structure I described
@AP-eh6gr3 ай бұрын
around the 22nd minute mark, when you suddenly receive a large number of upvotes on a comment and it needs to be reordered putting it ahead of the formerly most upvoted comment, what sort of complexity are you looking at? oh ok missed the "maintaining a sorted list of edges" statement you make at 21:41
@jordanhasnolife5163
3 ай бұрын
Yep ideally we have to sort those edges. It shouldn't be n log n time because the list is sorted, so just changing the existing sort order should be O(n) like an iteration of insertion sort.
@AP-eh6gr
3 ай бұрын
@@jordanhasnolife5163 it could be simply logn if it is a treeset (as opposed to list), so that way you delete old count and insert new count both in log time