3: Dropbox + Google Drive | Systems Design Interview Questions With Ex-Google SWE

Ғылым және технология

For some reason my ex-girlfriend kept asking me to call her my dropbox I don't get it

Пікірлер: 114

  • @ricardobedin2953
    @ricardobedin29536 ай бұрын

    Hey Jordan -- just wanna say thank you. I got multiple staff-level offers with big tech companies and your videos were the main resource I used for system's design. You are doing a phenomenal job that no other channel (that I know of) is close to doing. Thanks my man!

  • @jordanhasnolife5163

    @jordanhasnolife5163

    6 ай бұрын

    Let's gooo! Congrats Ricardo, makes me super happy to hear! Glad your hard work paid off

  • @johnlong9786

    @johnlong9786

    4 ай бұрын

    And thank you, Ricardo, for referring me here in your article about your journey.

  • @truptijoshi2535

    @truptijoshi2535

    27 күн бұрын

    Hey @ricardobedin2953, could you please let me know if you used any other source along with this one for staff-level offers?

  • @debarshighosh9059
    @debarshighosh90593 ай бұрын

    Hey Jordan, in the capacity estimation, I think the total doc size should be 10 PB(1b users * 100 docs/user * 100kB per doc = 1B * 10^7 bytes = 10pB), rather than 100TB

  • @jordanhasnolife5163

    @jordanhasnolife5163

    3 ай бұрын

    Thanks for catching!

  • @yrfvnihfcvhjikfjn
    @yrfvnihfcvhjikfjn6 ай бұрын

    Hey Jordan. I just wanted to express my condolences. I know you are going through a tough time now after such a significant loss. 😢

  • @jordanhasnolife5163

    @jordanhasnolife5163

    6 ай бұрын

    I appreciate the condolences. Not every day you lose your virginity

  • @yrfvnihfcvhjikfjn

    @yrfvnihfcvhjikfjn

    6 ай бұрын

    @@jordanhasnolife5163 you voice does sound pretty raspy in this video

  • @zhonglin5985
    @zhonglin59852 ай бұрын

    hey Jordan, nice content again, and thanks for answering all my previous questions! One question for this video -- how would you handle hash collision when you use MD5 as the uniquely identifier for blocks?

  • @jordanhasnolife5163

    @jordanhasnolife5163

    2 ай бұрын

    I think these would be pretty darn infrequent, but if we needed to we could perform the hashing check, but we only don't reupload if the chunk with the same hash corresponds to this fileId

  • @siddharthgupta6162
    @siddharthgupta61624 ай бұрын

    This is so much better than the other Dropbox system design and so easy to understand. You are a gem, bro! However, I do see you use a lot of CDC in the system designs - is it used in industry as well? or it it basically because we don't want to assign another server for this stuff and let the DB take care of it? Also, can we not use it with spark consumers? I have never seen CDC used in grokking (didn't like the content so stopped using it), and also didn't see in my career (so far) so just curious.

  • @jordanhasnolife5163

    @jordanhasnolife5163

    4 ай бұрын

    From what I understand CDC is a pretty new design pattern, however it's really semantically similar to just putting a write in Kafka and then using a stateful consumer to handle it and upload it to multiple places, it just changes which data sink is the source of the truth! The main reason that I use CDC is because to keep multiple data sources in line, using stream processing frameworks with guarantees about message delivery is to me a lot better than using two phase commit all over the place or having to reason about failure scenarios.

  • @1986yatinpatil
    @1986yatinpatil11 күн бұрын

    Hey Jordan, Great content! Your system design videos offer a refreshing perspective compared to the repetitive patterns seen online. I have a question regarding storing files (and posts from your previous video) in a per-user cache. Is this cache stored in memory or persistently? Also, in the event of data loss from this per-user cache, would the setup involving Kafka, Flink, and the cache be capable of rebuilding the state from scratch? And similar question about the caching we do on the flink node. If the Flink node goes down, can it restore the state of the in memory cache or will it have to replay all messages from Kafka to restore the state? Thanks!

  • @jordanhasnolife5163

    @jordanhasnolife5163

    10 күн бұрын

    Realistically, I'd think it would have to be on disk, since otherwise that would be memory usage per user which could be expensive! While there's not a great way to rebuild the cache downstream (without it being in the flink ecosystem and having one flink node routing messages to a per user flink cache via kafka), the cache will refill itself as more people post!

  • @deepitapai2269
    @deepitapai22693 күн бұрын

    Great video, Jordan! Could you elaborate a little bit on why we chose a Kafka queue over Cache for Unpopular file changes?

  • @jordanhasnolife5163

    @jordanhasnolife5163

    2 күн бұрын

    While we could aggregate the full document state in a server and send that to the user, we want to only send incremental changes that a user needs to apply. We can do this by having them listen to a kafka queue of these ordered incremental changes to those files.

  • @huguesbouvier3821
    @huguesbouvier38216 ай бұрын

    Thank you for the video! I really like your videos, the quality of the second batch is excellent. How you dive deep in the problems make for very interesting viewing. For the chunkDB, how would you solve the issue of 2 writers creating a new version at the same time (phantoms read I think?). The solution would be materializing conflict? How could we do that?

  • @jordanhasnolife5163

    @jordanhasnolife5163

    6 ай бұрын

    Assuming it's single leader replication basically the idea is just to use a predicate lock on fileId and version you're claiming This should ideally be fast as we have an index on those fields!

  • @kanani749

    @kanani749

    2 ай бұрын

    @@jordanhasnolife5163 I want to build off the other question. Could you provide clarification if possible. In order to achieve fast throughput when writing to chunkDB would serializable snapshot isolation be the most optimal way of resolving potential concurrency issues? Additionally, would the predicate lock be on the File-ID and Version #? Would this be to prevent write skew and specifically a phantom write? Also I'm assuming we would need to use zookeeper or etcd to manage holding these distributed locks?

  • @jordanhasnolife5163

    @jordanhasnolife5163

    2 ай бұрын

    @@kanani749 I think it really depends on how often you expect to have to lock. Even something like a primary key I would think needs to get a lock if it's to be monotonically increasing, so in reality, the true answer is as always, it depends :) This being said, I think this is something that gets abstracted from you in the db so you likely don't have to worry much about it.

  • @davidwang9350
    @davidwang935018 күн бұрын

    Thanks for these great videos Jordan! Very thorough and well-explained. Quick question: If we wanted to support the additional functionality of "given a file, return which users already have access to it", what would you opt for? I was thinking that the query to go from "fileId" --> "user's with access" is slow since the Permissions DB is indexed/partitioned on userId. Would the ideal solution be similar to what you employed in the Twitter design? Use CDC from Permissions DB to construct a new DB (basically derived data) which partitions and indexes on fileId?

  • @jordanhasnolife5163

    @jordanhasnolife5163

    17 күн бұрын

    I think that would be very reasonable yeah

  • @YTJones
    @YTJones5 ай бұрын

    Thanks for the excellent vids Jordan! I had two questions: 1. Could we use something like change data capture to keep S3 and the chunks DB in sync? the chunks DB could be seen as derived data from what's in S3, so could you turn the S3 uploads into an event stream and use the same Kafka + flink combo as in other vids? 2. In the hybrid approach where we push most docs into user caches, what do we do to handle cache invalidation if the doc is edited? Do we just write to the cache directly first to make sure the user sees it, then write to S3/the appropriate chunks DB and propagate to any other users? Or do we first need to write to the central sources then update the cache, allowing a brief period where the user can't see their own update (sounds no bueno)?

  • @jordanhasnolife5163

    @jordanhasnolife5163

    5 ай бұрын

    1) I'm actually not sure whether s3 supports cdc, but if it does then yeah I'm all for it! 2) I think every file change would basically go through the same pipeline of into the object store then into flink and off to user caches. I'd imagine that after a user makes an edit we'd probably perform a client side optimization where we locally cache their edit and if the version number is higher than whatever is in their cache we show them their local version. Nice questions!

  • @CompleteAbsurdist

    @CompleteAbsurdist

    2 ай бұрын

    @@jordanhasnolife5163 > I'm actually not sure whether s3 supports cdc, but if it does then yeah I'm all for it! Lil late to the party but here goes - You can configure to trigger a Lambda function on an S3 file operation which can in turn call your application to do what you want. Alternatively, you can trigger an SQS or SNS event on the addition of a new s3 file that your application can listen to.

  • @jordanhasnolife5163

    @jordanhasnolife5163

    2 ай бұрын

    @@CompleteAbsurdist Very cool! You'd have to be able to derive the metadata from the document, perhaps you could embed it in the name somehow

  • @jordanhasnolife5163

    @jordanhasnolife5163

    2 ай бұрын

    @@CompleteAbsurdist Very cool! Thanks for sharing!

  • @just_a_normal_lad
    @just_a_normal_lad4 ай бұрын

    Wow, I truly appreciate your videos! Every time I watch them, it serves as a powerful reminder of how much there is to learn. One doubt , What do you mean by sharding in Kafka ? AFAIK there is no sharding concept in Kafka. 2 approaches which I can think. 1. Have 1 Kafka Cluster setup for each shard and have a single topic and push all events to it. 2. Have 1 kafka cluster setup and multiple topics for each shards, so clients will push the events to specific topic based on the sharding logic. Anything else you wanted to convey or any other approach which i missed??

  • @just_a_normal_lad

    @just_a_normal_lad

    4 ай бұрын

    @jordanhasnolife5163 Also, whatever approach we select we are sure that for each user we need to have a separate consumer group so that it can subscribe to the single kafka queue/topic. Are we OK with having billions of consumer group in Kafka?? One more tradeoff with this approach is that if User1 and User2 belong to same shard then both of them will be reading the same set of events, in that case some events wont be applicable for most of the users but they will still consume it and ignore it.

  • @jordanhasnolife5163

    @jordanhasnolife5163

    4 ай бұрын

    When I say sharding I mean partitioning, which definitely exists in Kafka. I don't need a single consumer per userId, but I do want to ensure that each userId has only one consumer reading messages that correspond to it, hopefully this makes sense! Yes, I don't particularly mind that the same consumer will be handling requests of multiple users, otherwise we'll have billions of consumer groups like you mentioned.

  • @chingyuanwang8957
    @chingyuanwang89574 ай бұрын

    Hi Jordan, amazing video, it really helped me out! Quick question on the file routing part: if the file cache only contains the file ID, and we still need to check the file's details, why is it necessary to propagate file changes to the user? Thanks for clarifying!

  • @jordanhasnolife5163

    @jordanhasnolife5163

    4 ай бұрын

    Hey Ching Yuan! I'm not 100% sure what you mean here. We need to let users know when the chunks of a given file have changed so that they can go to S3 and fetch the new chunks. Feel free to clarify and I'll get back to you.

  • @saimunshahee5823
    @saimunshahee58235 ай бұрын

    Dumb question maybe but should the CDN's be in front of the S3 service? I.e should the file readers/write be hitting the CDN first and then the CDN routes to S3? I might have an incorrect understanding of this Awesome channel btw, learned a lot from your System Designs 2.0 playlist!

  • @jordanhasnolife5163

    @jordanhasnolife5163

    5 ай бұрын

    I think that's potentially a fair option to lower write latency, but it could also be possible that we then evict data that we want to be in the cdn from it, so there are tradeoffs

  • @kushalsheth0212

    @kushalsheth0212

    3 ай бұрын

    i am not getting why we are uploading file to S3, we are already saving chunks to chunkDB, then why we need S3? can you please help me to understand

  • @adityagarg6466

    @adityagarg6466

    2 ай бұрын

    @@kushalsheth0212 ChunksDB contains the metadata info, and not the real file.

  • @meenalgoyal8933
    @meenalgoyal89333 ай бұрын

    Hey Jordan, thank you so much for these videos. I find them very helpful and in-depth. I had 2 questions. 1. How would the client know for the first time about any new file shared with them, specially popular files? For non-popular files, we can still update per user cache to reflect that so when user checks the cache they know about it. but for popular files, we rely on user already knowing fileId. Do you think user needs to do periodic call to permission db to get an idea about files shared to them or may be long-polling request to the permission db? 2. I understand that Apache flink seems like a good option for many design questions due to partitioning, and caching capabilities to maintain the state from one stream so we can join it when message from other stream comes. But out of curiosity, what would be good alternative to Flink? Thoughts?

  • @jordanhasnolife5163

    @jordanhasnolife5163

    3 ай бұрын

    1) If I recall correctly, in this design the client doesn't need to be aware of its file access, it stays connected to one server which has all of its changes delivered directly to it. As for popular files, you're correct that we need to do extra work because we are polling for changes. I'd say that on each poll the client should first poll the permissions db to see what files it has permissions for, and which of those are popular. From there it can poll the servers that hold changes to popular files. See the twitter design, it's very similar. 2) Technology wise: kafka streams, spark streaming. Design wise? I'm not entirely sure, I think at the end of the day each client wants to minimize its open connections, and so you can't really avoid using some intermediary layer that delivers the relevant changes to the server that the client is listening to. Message queues with a stream processing framework seems like the best choice here to avoid constant polling.

  • @meenalgoyal8933

    @meenalgoyal8933

    3 ай бұрын

    @@jordanhasnolife5163 Thank you! 1) makes sense. For 2), yes I was asking technology wise. I got introduced to flink through your videos only. Do you know if kafka stream also offer similar things as flink: 1. fault tolerance and restoring from checkpoint after crash? 2. Have a cache which stores state when consuming multiple streams and joining them real-time? I suspect #2 might not be offered in kafka stream.

  • @__noob__coder__
    @__noob__coder__6 ай бұрын

    Timestamp - 19:40, can we use S3 events to detect new chunks being uploaded to S3 ? So a worker when detects this kind of event, can put an entry in the database. The chunks being uploaded to S3, will contain some metadata, like which file id and version do they belong to. So the worker can get file id and version for the chunk from the S3 object metadata.

  • @jordanhasnolife5163

    @jordanhasnolife5163

    6 ай бұрын

    Possibly! Hadn't heard of s3 events before but something like that could work too. I think for an interview you generally want to be as technologically agnostic as possible, so unless every cloud storage provider has handlers like these maybe I'd avoid it. IRL would work great tho

  • @dmitrigekhtman1082
    @dmitrigekhtman10825 ай бұрын

    Good stuff! I agree with the guy further down in the comments that the final read interaction with the queues is confusing -- what is the end-user doing with the file changes coming off the queue and how does the user read from the queue?

  • @jordanhasnolife5163

    @jordanhasnolife5163

    4 ай бұрын

    Sure! So ideally what the messages are in the queues are basically just telling you the diff between the current chunks of the file and the new version of the file. So from there you could basically see the S3 urls of those different chunks and go fetch them. The user reads from the queue by subscribing to their own partition (based on userId). It can also subscribe to 10-20 "popular partitions" for documents with tons of users.

  • @AmitYadav-cn6gj

    @AmitYadav-cn6gj

    4 ай бұрын

    @@jordanhasnolife5163 - How would this design deal with permissioning of the popular files. It seems like the users would have visibility into all the popular files (since they are subscribing to 10-20 "popular partitions") regardless of whether they have permissions to even be aware of the presence of these files ?

  • @jordanhasnolife5163

    @jordanhasnolife5163

    4 ай бұрын

    @@AmitYadav-cn6gj Yep, good point! I think that for this particular case you would basically have to interact with these queues via an intermediate server which is aware of the permissions and sends you documents back accordingly.

  • @rakeshvarma8091

    @rakeshvarma8091

    9 күн бұрын

    @@jordanhasnolife5163 In this scenario, the intermediate server will end up sending the document to several users right since it is popular.

  • @jordanhasnolife5163

    @jordanhasnolife5163

    8 күн бұрын

    @@rakeshvarma8091 I'd say it would be more likely that users would query the popular server rather than it pushing the documents to users. You could also have many consumers on the popular queue so that you can have many servers that serve requests here.

  • @saurabhmittal6947
    @saurabhmittal694713 күн бұрын

    hey jordan, loving your content but I was wondering how would version increment here so lets say user has created a file which is v0 the he make some changes then we would be continuously uploading those changes to s3 right ?? If that's the case then would be keep incrementing the version with timestamp or what ? if lets say say, uploading to s3 is based on a trigger then we keep incrementing the version of file by 1 of currently loaded version, right ?

  • @jordanhasnolife5163

    @jordanhasnolife5163

    12 күн бұрын

    I envision it as uploads only get made when you hit save, and then you increment the version number of the highest known version by 1. If there are conflicts you have to resolve them locally.

  • @5atc
    @5atc5 ай бұрын

    Thank you for the video! Does each Flink instance effectively maintain an up-to-date, local slice of the permissions DB for the files contained within the Flink instance's shard of fileIds? How does that get stored inside the Flink instance?

  • @jordanhasnolife5163

    @jordanhasnolife5163

    5 ай бұрын

    Yep, basically! It gets there via change data capture.

  • @lagneslagnes
    @lagneslagnes8 күн бұрын

    Chunks DB: How do you get the latest versions across all chunks of an document? You might want to add another boolean field called "latest" ? I'm assuming each chunk of same document can have different latest version. If we tried to increment the version of all chunks at same time, kind of defeats the purpose of having chunks.

  • @jordanhasnolife5163

    @jordanhasnolife5163

    7 күн бұрын

    You could always upload metadata rows for an entire version of a document, and then only upload the "new" chunks to S3, since that's the expensive operation. Adding a metadata row for each is fine.

  • @mytabelle
    @mytabelle4 ай бұрын

    When talking about hot/popular files: - if I understand correctly for your proposed solution, we have one connection for normal files, and one additional one for every popular file we have access to. For the unpopular files, the notification service would push these updates to the user, but for the popular ones, we'd push to some service and its replicas, and users would pull from that service, correct? - is there an issue with sending update notifications in a batch? We could keep one connection for each user, but simply send, say, 100 updates per second for popular files. This means there's a minor delay for the end user, but makes our system less complex.

  • @jordanhasnolife5163

    @jordanhasnolife5163

    4 ай бұрын

    1) yep! 2) I'm a bit confused what you're saying here, we could certainly send batch updates, but then we have to send them to a lot of users automatically. This can definitely work, but it may be easier for users to just poll the changes as they want them. Nice questions!

  • @5atc
    @5atc5 ай бұрын

    Wouldn't the chunk versioning result in a lot of additional metadata DB rows if a small change is made to a large file? E.g. if a single 4MB chunk is modified in a 1GB file, does that mean we increment the version number and add 250 new rows to the metadata DB, where 249 rows point to the existing S3 chunks (but with the incremented version number) and 1 row points to the newly modified chunk?

  • @5atc

    @5atc

    5 ай бұрын

    Also, am I right in thinking that by using a SQL DB, ACID transactions mean that all the rows for a new version number would be written together? Which is good, because we avoid situations where the lead or follower replicas have some but not all rows for the chunks of a specific version number. I think this would be a challenge if we chose a NoSQL DB?

  • @jordanhasnolife5163

    @jordanhasnolife5163

    5 ай бұрын

    Yup, that's my reasoning for using SQL here! That being said, some NoSQL databases still support atomic writes I think. As to your second point, which is a good one (and I should have touched upon more within the video), you can basically just upload metadata rows for the new chunks (and then for the chunks that already exist you'd use the existing metadata rows from the highest possible version). Of course there would have to be some smart logic regarding how to know which of the previous version chunk metadata rows that we'd have to pull in.

  • @5atc

    @5atc

    5 ай бұрын

    That makes sense, thanks! These videos are great, keep it up!

  • @kushalsheth0212
    @kushalsheth02123 ай бұрын

    hey, good video, great explanation, but have a doubt that why we are storing in S3 and also in chunkDB at 18:40, shouldn't we only save chunks in chunkDB?

  • @jordanhasnolife5163

    @jordanhasnolife5163

    3 ай бұрын

    Chunk db is really for the metadata about the chunk, s3 is for the actual video content

  • @AdityaRaj-bo9qe
    @AdityaRaj-bo9qeАй бұрын

    Hi Jordan, Regarding flink, there is a concept of window, that for how much interval of time we do the joining part, so that we don't store entire data for a stream in its lifetime in memory. What window type shall be used here.

  • @jordanhasnolife5163

    @jordanhasnolife5163

    Ай бұрын

    Hey! No window here, as I'm attempting to limit the amount of data that we're performing joins on (so... infinite window)

  • @zachlandes5718
    @zachlandes57184 ай бұрын

    You mention zookeeper near the end -- what will you manage with zookeeper in the diagrammed design?

  • @jordanhasnolife5163

    @jordanhasnolife5163

    4 ай бұрын

    We pretty much need zookeeper for managing which partitions belong on which nodes, as well as when to perform failovers when a given leader goes down.

  • @zhonglin5985
    @zhonglin59852 ай бұрын

    Suddenly got a low level question about Flink's partitioning -- is it required that the all ChunkTable CDC topic, PermissionTable CDC topic and Flink cluster have to be partitioned in the same way? In the same way I mean not only they should use the same partition key (fileId), but they should use the same hash function and the same number of partitions.

  • @jordanhasnolife5163

    @jordanhasnolife5163

    2 ай бұрын

    Yeah, it basically has to be, our else the data could be going to different flink nodes

  • @levyshi
    @levyshiАй бұрын

    For the pushing file changes to users, does it mean like when the user is viewing the file, and the owner changes the content? the file should change in real time? If not, couldn't the user just fetch the latest version from s3 when they open the file? or the purpose of this is to reduce fetching of entire files, and if the file is already on the user's device, we just need to make sure the updated changes are also there.

  • @levyshi

    @levyshi

    Ай бұрын

    also, if I add permission to another user, does it require a distributed transaction since it's sharded on user id, and if two users are on different nodes, we'll need to read to one and then write to another

  • @jordanhasnolife5163

    @jordanhasnolife5163

    Ай бұрын

    1) I think this is more of a design question, but yes it could mean they would see changes in real time. More likely, what it means is that clicking to open a file on your computer doesn't take a few seconds to poll the latest version of it. 2) Not sure what you mean here. We're just adding permissions for one user at a time (userId, fileId) to the appropriate partition in the database

  • @levyshi

    @levyshi

    Ай бұрын

    ​@@jordanhasnolife5163 Yeah, sorry Ignore the second question

  • @michaelparrott128
    @michaelparrott1286 ай бұрын

    Would be interesting to hear about how exactly the file reader connects to these queues. Are the queues something like SQS? How would a client device (e.g. phone or computer) connect to that? When thinking about, I thought one solution could be to broadcast that there is a new version for file X, then the client goes to a file reader service to read the new chunk data.

  • @jordanhasnolife5163

    @jordanhasnolife5163

    6 ай бұрын

    The queues are just kafka queues, super easy to subscribe to a given topic

  • @bshm1718

    @bshm1718

    6 ай бұрын

    @@jordanhasnolife5163since we have 1 billion users, can we have 1 billion topics in kafka and each user subscribes to their topic? can kafka scale to so many topics?

  • @deantodoroski5059

    @deantodoroski5059

    5 ай бұрын

    @@jordanhasnolife5163 Do you suggest client apps (android, ios, web, desktop) connect directly to Kafka? I have no experience with Kafka, quick google search shows people recommending against and putting some proxy in front of Kafka. Do you have experience with this? Btw, great videos! :)

  • @truptijoshi2535
    @truptijoshi253528 күн бұрын

    Hey Jordan, How do we divide the file into chunks and how do we determine their order? How do we get to know which chunk was modified? Thank you!

  • @jordanhasnolife5163

    @jordanhasnolife5163

    25 күн бұрын

    I'd imagine they probably use something like merkle trees for this

  • @priteshacharya
    @priteshacharya2 ай бұрын

    I have question on "The Unpopular file changes" queue. Since it's partitioned on userID, a single partition would cover multiple users. Lets say a single partition has 10000 users, does that mean a client would have to listen to updates from 99999 users when it's only 1 that they need to? Someone else mentioned it's not a good idea to expose kakfa queue directly to the user. I agree with them. There has to be a service that connects kafka queue to the user.

  • @jordanhasnolife5163

    @jordanhasnolife5163

    2 ай бұрын

    Very reasonable, another very easy solution is to just have a server read from that kafka queue and use a websocket to push the changes to the user. See messenger, notification service design.

  • @SaiAkhil-my8ps
    @SaiAkhil-my8ps2 ай бұрын

    if the unpopular file changes are partitioned on userId, if the user has not opened the file which has a change and the message is at the top the queue, what are we doing with that kafka message?

  • @jordanhasnolife5163

    @jordanhasnolife5163

    2 ай бұрын

    Not entirely sure what you mean here. If the user doesn't have the file open, and they miss multiple changes on a file, they can go ahead and poll the most recent state of the document from the db

  • @Nisshant1112
    @Nisshant11124 ай бұрын

    hey Jordan! When we are partitioning kafka on File Id won't we end up having billions of partitions in kafka?

  • @jordanhasnolife5163

    @jordanhasnolife5163

    4 ай бұрын

    Hey Nisshant - when I say partition by fileid, that really means "use consistent hashing over the range of fileId hashes" as far as Kafka is concerned. Hopefully this makes sense.

  • @Prakhar1405
    @Prakhar14053 ай бұрын

    Nice Video Jordan. I have a question on how do we read file data. For ex, lets say file has version 4, which contains only changed blocks. And customer adds a new device which does not have the file. How do we construct the file on device?

  • @jordanhasnolife5163

    @jordanhasnolife5163

    3 ай бұрын

    You go to the database and ask it for all of the blocks of the file! Ideally each chunk has a reference to the chunk before it so we can reconstruct easily by reading the database

  • @Prakhar1405

    @Prakhar1405

    3 ай бұрын

    @hasnolife5163 Thanks for the explanation. I did well in my design interview by following your videos. Thanks a lot!!

  • @jordanhasnolife5163

    @jordanhasnolife5163

    3 ай бұрын

    @@Prakhar1405 That's great man!! Super glad to hear :)

  • @maxvettel7337
    @maxvettel7337Ай бұрын

    Hi Jordan, great explanation but I don't understand how client assembles file from chunks after downloading. In what order should the chunks be assembled?

  • @jordanhasnolife5163

    @jordanhasnolife5163

    Ай бұрын

    We can have either 1) the chunk number or 2) a pointer to the next chunk id in the metadata table per chunk

  • @maxvettel7337

    @maxvettel7337

    Ай бұрын

    @@jordanhasnolife5163 Thanks a lot

  • @andreybraslavskiy522
    @andreybraslavskiy522Ай бұрын

    Hi Jordan, thank you for your effor to help us prepeare for the interview. Can you help me with the question. Flink reads from Kafka and caches data about file to user permission. What if Flink crashed. How it will restore its state? Reread both kafka topic from the start? Or it has a replica of its state somewhere?

  • @jordanhasnolife5163

    @jordanhasnolife5163

    Ай бұрын

    It occasionally snapshots state in s3, and then you can reread from the kafka indexes that correspond to that state

  • @andreybraslavskiy522

    @andreybraslavskiy522

    Ай бұрын

    @@jordanhasnolife5163 thanks, I should learn more about Flink.

  • @visiablehanle
    @visiablehanle6 ай бұрын

    10pb?

  • @jordanhasnolife5163

    @jordanhasnolife5163

    6 ай бұрын

    Ah shoot did I mess it up? Lol

  • @bengillman1123

    @bengillman1123

    4 ай бұрын

    Yeah it's 10pb. BOE calculations are tricky

  • @ankitagarwal4022
    @ankitagarwal4022Ай бұрын

    Hi Jordan, How long flink will keep the file to user permission information ?

  • @jordanhasnolife5163

    @jordanhasnolife5163

    Ай бұрын

    Forever

  • @indraneelghosh6607
    @indraneelghosh66075 ай бұрын

    Does Flink store the state of all files? What if there are a billion files? Will it still be able to store the updated state of all the files? How does it do this exactly?

  • @jordanhasnolife5163

    @jordanhasnolife5163

    5 ай бұрын

    Partitions, partitions, and more partitions :) We're sharding on fileId.

  • @manishasharma-hy5mj
    @manishasharma-hy5mj19 күн бұрын

    Hi Jordan, u talked about in your videos that, crdt are useful in multileader replication. But here u are saying single leader replication for document permissions. Can u pls help in clarifying

  • @jordanhasnolife5163

    @jordanhasnolife5163

    18 күн бұрын

    Keyword is "document permissions". That's just a table mapping a docId to all the userIds that have access. The actual content of the document itself is maintained by a CRDT.

  • @zalooooo
    @zalooooo5 ай бұрын

    Nice vid but no mention of operational transform algorithm or crdts for concurrent writes to the same block? I think that's a critically important part of this problem to at least acknowledge. If I gave this question as an interviewer, I'd expect to see an acknowledgement of the challenges of multiple users modifying the same part of a document which your solution doesn't address.

  • @jordanhasnolife5163

    @jordanhasnolife5163

    5 ай бұрын

    This is Google Drive (which tries to act like a distributed file system), you're talking about Google docs. I'll do that one soon enough

  • @zalooooo

    @zalooooo

    5 ай бұрын

    Makes sense! I definitely misread G drive As G docs

  • @John-nhoJ
    @John-nhoJ6 ай бұрын

    Not covered - How do you enforce a storage limit, especially on things like live document edits AND file uploads?

  • @jordanhasnolife5163

    @jordanhasnolife5163

    6 ай бұрын

    What do you mean by live document edits here? I think that might come into play for Google docs. For the file uploads, you do know in advance how big each chunk is and could perform some client side validation on the size - (hit a backend service to tell you how much data you've uploaded, which gets changed every time that you upload new chunks).

  • @John-nhoJ

    @John-nhoJ

    6 ай бұрын

    @@jordanhasnolife5163 Uploading the file metadata isn't really secure. You could intercept the request and say that your massive chunk is 1 kb. By live document edits, I mean like 40 people collaborating in Google docs. You have to take the text into consideration when computing the overall storage usage of the document.

  • @user-se9zv8hq9r

    @user-se9zv8hq9r

    6 ай бұрын

    ​@@John-nhoJ did you know that before 2021 google docs / sheets space was infinite, which is how there were those 2 projects on github which chunked and uploaded files into base64 then put them in sheets and docs for infinite space? lol, like to store the ubuntu iso was 534 docs. Then google caught on and now includes it as part of the drive space. I'm curious how they do that though efficiently

  • @alekseyklintsevich4601
    @alekseyklintsevich46014 ай бұрын

    The way you're using Flink is not really what it is designed for. You're using it as an in memory store

  • @jordanhasnolife5163

    @jordanhasnolife5163

    4 ай бұрын

    I don't agree with you here. Flink is meant for processing messages in real time. I'm using it to handle incoming messages and distribute them to a variety of sink locations

  • @justlc7
    @justlc717 күн бұрын

    Where can we find your notes?

  • @jordanhasnolife5163

    @jordanhasnolife5163

    16 күн бұрын

    See my channel description

Келесі