98% Cloud Cost Saved By Writing Our Own Database

Ғылым және технология

Recorded live on twitch, GET IN
Article
hivekit.io/blog/how-weve-save...
By: / hivekit_io
My Stream
/ theprimeagen
Best Way To Support Me
Become a backend engineer. Its my favorite site
boot.dev/?promo=PRIMEYT
This is also the best way to support me is to support yourself becoming a better backend engineer.
MY MAIN YT CHANNEL: Has well edited engineering videos
/ theprimeagen
Discord
/ discord
Have something for me to read or react to?: / theprimeagenreact
Kinesis Advantage 360: bit.ly/Prime-Kinesis
Hey I am sponsored by Turso, an edge database. I think they are pretty neet. Give them a try for free and if you want you can get a decent amount off (the free tier is the best (better than planetscale or any other))
turso.tech/deeznuts

Пікірлер: 556

  • @Fik0n
    @Fik0nАй бұрын

    The best thing about saving 98% cloud cost is that developer hours are free and that this will be super easy to maintain when the original devs quit.

  • @JeremyAndersonBoise

    @JeremyAndersonBoise

    29 күн бұрын

    😂😂😂🎉😂😂😂

  • @7th_CAV_Trooper

    @7th_CAV_Trooper

    29 күн бұрын

    Read my mind

  • @trueriver1950

    @trueriver1950

    29 күн бұрын

    WARNING Irony detected

  • @lupf5689

    @lupf5689

    29 күн бұрын

    Why would that be a problem? Was anything shown here that hard to understand? You should know how to do file and network io. What's left is a bit of domain knowledge and a one-time effort to encode and decode a rather simple data structure. Sometimes I really don't get that "let's better not do it ourselves" mentality.

  • @andrasschmidthu

    @andrasschmidthu

    28 күн бұрын

    Skill issue.

  • @TomNook.
    @TomNook.Ай бұрын

    I saved 99% of my cloud costs by connecting my frontend to an excel spreadsheet. Such a great idea!

  • @SimonBuchanNz

    @SimonBuchanNz

    29 күн бұрын

    What's the 1%

  • @marcogenovesi8570

    @marcogenovesi8570

    28 күн бұрын

    @@SimonBuchanNz the frontend

  • @StuermischeTage

    @StuermischeTage

    21 күн бұрын

    You are a true genius. Are you available to optimize our IT department?

  • @opposite342

    @opposite342

    13 күн бұрын

    tom is a genius jdsl deez

  • @Melpheos1er

    @Melpheos1er

    11 күн бұрын

    99.9% for me because it's connected to libreoffice calc. I'm even saving on Microsoft Office costs !

  • @bfors8498
    @bfors8498Ай бұрын

    I call this impressive-sounding-blogpost-driven-development

  • @Fik0n

    @Fik0n

    Ай бұрын

    Medium-driven-development

  • @Peter-UK-nl6cv

    @Peter-UK-nl6cv

    Ай бұрын

    RDD - resume driven development

  • @neo-vj4zq

    @neo-vj4zq

    Ай бұрын

    Did these numbers before getting out of the garage office stage

  • @EagerEggplant

    @EagerEggplant

    28 күн бұрын

    How about bdedd, read bidet: big-dick-energy-driven-development

  • @zimpoooooo

    @zimpoooooo

    4 күн бұрын

    I call it fun.

  • @ivanjermakov
    @ivanjermakovАй бұрын

    TLDR: they wrote their own log file. No ACID = not a DB.

  • @krux02

    @krux02

    Ай бұрын

    you forgot to put in the nerd emoji 🤓

  • @monolith-zl4qt

    @monolith-zl4qt

    Ай бұрын

    @@krux02 is it nerdy to know the absolute basics of CS?

  • @jerrygreenest

    @jerrygreenest

    Ай бұрын

    Log file stores entire stream of data, and they seem to store both «last state» data (as last as possible), and a log file, too. So technically it’s kinda like a simple database after all. From log file they can probably write entire path of car movement for example, as it is a series of data. For rare cases when you truly need this history. In database, they have their current position, battery/fuel levels, etc. For common cases.

  • @andreffrosa

    @andreffrosa

    Ай бұрын

    Then no-sql dbs are not dbs?

  • @tropicaljupiter

    @tropicaljupiter

    Ай бұрын

    @@monolith-zl4qtconsidering how self taught everyone is: yes, sort of

  • @PeterSteele111
    @PeterSteele111Ай бұрын

    I do GIS at work and have several hand held units that connect over bluetooth to iOS and Android on my desk right now that can get sub meter accuracy. I have even played with centimeter accuracy. I have trimble and juniper geode units on hand. I built the mobile apps we use for marking assets in the field and syncing back to our servers, and am currently working on an offline mode for that. So yeah, GPS has come a long way since last you looked. Internal hardware is like 10-20 meters on a phone, but dedicated hardware that can pass over as a mock location on Android or whatever can get much much more accurate results.

  • @mcspud

    @mcspud

    Ай бұрын

    Its not GPS, its the tesselators that process it.

  • @7th_CAV_Trooper

    @7th_CAV_Trooper

    29 күн бұрын

    We used to average over N readings to get pretty good sub meter precision, but I don't think GPS is any better than 5 to 10 meters today. The trade off was battery life. More readings allows better precision, but burns battery. Less precision means the device can run longer without charge or replacement.

  • @Hyperlooper

    @Hyperlooper

    28 күн бұрын

    Isn't it a restriction put in place for government?

  • @7th_CAV_Trooper

    @7th_CAV_Trooper

    28 күн бұрын

    @@Hyperlooper The US Gov used to limit, but no longer: www.gps.gov/systems/gps/modernization/sa/

  • @honkhonk8009

    @honkhonk8009

    28 күн бұрын

    I only know about Trimble because of the fucking Tractor edits where they play on that whole "missile guidance system" meme lmfao

  • @dv_xl
    @dv_xlАй бұрын

    You mentioned at the beginning of the video that making your own language makes sense if its designed to actually solve a problem in a better way. This is that. They did not attempt to write a general purpose db. They wrote a really fast log file that is queryable in real time for their domain. This wins them points in costs (margins matter) but more importantly, gives them a marked advantage against competitors. Note that theyre storing and querying way more efficiently. Quality of product is improving while cost of competition is increasing. Seems like a no brainer on the business side.

  • @polariseve1391

    @polariseve1391

    Ай бұрын

    How does one make a log file?

  • @TurtleKwitty

    @TurtleKwitty

    Ай бұрын

    A MAJOR part of this that went unmentioned, they didn't try to get all their data in there either, just the specific domain that they operate in so they're still clearly using aregular DB for anything else and that's why the version field is a lot less important for their use case, it's well known data theyve been dealing with for a while and its a specific subset of the data they use

  • @krisavi633

    @krisavi633

    Ай бұрын

    @@TurtleKwitty Yep, like writing parts of python in rust, just the ones that hit performance the most in python.

  • @domogdeilig

    @domogdeilig

    29 күн бұрын

    @@TurtleKwitty Less important doesnt mean unimportant. They will have to change this at one point, and this will cause the worst headache in the universe.

  • @wwjdtd1

    @wwjdtd1

    28 күн бұрын

    ​@@domogdeilig Depending on the wrapper, they might not. If you change a device ID on update, then you can just point to the old ID for archive retrieval and store what version the device is using. Possibly even encoding it into the ID itself. You can even run a different database backend for v1 and v2 since I doubt you would make a breaking change very often. Then you just query the right DB.

  • @michaelcohen7676
    @michaelcohen7676Ай бұрын

    Chat misunderstanding RTK. RTK is literally just correcting GPS data using a known point in realtime. It is not better than GPS, it just enhances the way the measurements are interpreted

  • @dobacetr

    @dobacetr

    29 күн бұрын

    Let's expand this a little for the curious. GNSS works by measuring the timing of a signal between the receiver and the (constellation of) satellites. Since we know the speed of these signals, we can calculate the distance from the timing. In 3D Space, we need 3 (linearly independent) measurements to pin-point a location. In Space-Time (4D) we need 4 (Time is unknown because all clocks are imprecise, and since we are talking speed of light, every nano-second matters, by about 30cm :) ). This is how we know the position from the GNSS. However, there are factors which need to be considered. The signal traverses from space to the ground, trough atmosphere. There, the signal is corrupted by various effects. Some of these are tracked and accounted for (may look-up Tropospheric correction and Ionospheric correction). After these you may get your position accuracy down to few meters. However, there are still some errors that could be predicted left. But, you would need a closeby station to measure those effects. RTK is when you use a station with known position to measure these residuals. Then, you could use the same correction for any nearby device to improve their accuracy. Depending on conditions you may get centimeters-decimeters accuracy. However, generally speaking, in a city I would not expect more than a meter accuracy. I would probably not trust it to be that precise either. RTK relies on having similar conditions and there may be interference that isn't similar. Or it may be just my paranoia. Let me know if I have missed anything, or made a mistake.

  • @zerker2000

    @zerker2000

    28 күн бұрын

    And GPS is not better than dead reckoning, it just corrects data location drift :^)

  • @GrizikYugno-ku2zs
    @GrizikYugno-ku2zsАй бұрын

    I signed in and made a youtube account just now to say THANK YOU! 15:00 I DIDN'T THINK ABOUT VERSIONING MY DATA! Sometimes, the things you don't know when self taught are just jaw dropping. This has been very humbling.

  • @GrizikYugno-ku2zs

    @GrizikYugno-ku2zs

    Ай бұрын

    Follow up note: I can see why this is particularly dangerous for binary, but it certainly applies to all data driven applications. Workarounds in JSON would be possible, especially with Rust, but versioning makes it so, so simple. I am nothing close to a novice or junior - despite how naive I was here - so it really goes to show that you must always remain a student. Seven years of building all types of systems, and yet that means nothing when it comes to things I haven't done. I've only ever built systems and thrown them away when they didn't make money. Running something long term requires maintenance which is something I NEVER thought about. Wow. This is why Prime is great. Everyone else gives useless tips and tricks to people learning JavaShit. Nobody else is out here helping programmers who are already competent and capable. THANK YOU THANK YOU THANK YOU!!!!

  • @precumming

    @precumming

    Ай бұрын

    Well, if you're making your own binary format you ought to have looked at how other people have done it and you always see a version first thing which should tip you off

  • @diadetediotedio6918

    @diadetediotedio6918

    Ай бұрын

    It is an excelent thing, but it also has nothing to do with being "self-taught" or not.

  • @andrasschmidthu
    @andrasschmidthu29 күн бұрын

    Great solution! If they want to further optimize they should use fixed point instead of floating point and do variable length difference encoding. Most numbers would fit 8 or 16 bits. Using that the memory requirement could easily be half or even less. The size of the entry should be stored in uint16 or uint8 even. If size>65536 is possible then use variable length encoding for the size too. The whole data stream should be stored like that: a stream. 30.000 34 byte entries a second is 1MB/s which is a joke. Write all logs into a stream and parallel collect them for each data source in RAM until a disc block worth of data is collected. Only flush the whole blocks to the disc. This would optimize storage access and you could reach bandwidth limit of the hardware. In case of power failure the logs have to be re-processed like a transaction log is reprocessed by a database. Once we have optimized such a logger that we used no FS raw access to a spinning HDD and we could sustain very good write bandwidth using cheap hardware.

  • @Michaeltje01
    @Michaeltje01Ай бұрын

    8:58 KeyboardG: "high write and buffered is Kafka" Yeah I'm with this comment. I still don't understand why they couldn't use Kafka instead of some custom DB.

  • @themichaelw

    @themichaelw

    Ай бұрын

    100% this is literally just kafka but shittier. Kafka is crazy fast because it uses DMA and can move data from network card buffers to disk without copy and without CPU involvement. This article honestly reads like some engineers with too much ego to realize that the optimal solution here is to write to kafka and dump that to static storage as an intermediate step, or just right to a data warehouse.

  • @rainerwahnsinn2150

    @rainerwahnsinn2150

    Ай бұрын

    Kafka and then writing into Delta was my first thought.

  • @guptadagger896

    @guptadagger896

    Ай бұрын

    would you still have to aggregate out of kafka into something else for reporting

  • @Serizon_

    @Serizon_

    29 күн бұрын

    @@themichaelw I understand , kafka seems nice though I don't understand what kafka does :/

  • @7th_CAV_Trooper

    @7th_CAV_Trooper

    28 күн бұрын

    @@guptadagger896 why? Kafka is an event db. It supports SQL style queries.

  • @jsax01001010
    @jsax0100101028 күн бұрын

    5:35 Preping for scale can be worthwhile if they manage to get a contract with a very large company. A company I work for recently contracted with a company that provides a similar service. The small scale test with 2,000 GPS trackers was straining their infrastructure. The full rollout of 200,000 trackers broke their service for a week or two while the had to rush to scale up their service by about 20x.

  • @woodendoorgarage

    @woodendoorgarage

    28 күн бұрын

    Meaning they only have very vague idea how their system performs and scales. Not a great sign to be honest. They should have emulated your production load by themselves to figure out the scaling issues beforehand.

  • @christ.4977
    @christ.4977Ай бұрын

    Isn't streaming data like this what kafka was made for?

  • @thomas-sinkala

    @thomas-sinkala

    29 күн бұрын

    Read this post like 3 weeks ago and that was my question. Kafka, just use kafka!

  • @retagainez

    @retagainez

    28 күн бұрын

    Perhaps it wasn't considered due to the fact that majority of customers deploy on-prem?

  • @georgehelyar

    @georgehelyar

    28 күн бұрын

    ​@@retagainez you can deploy Kafka on prem easily enough. Also they said they were coming from AWS Aurora so their on prem thing is a bit weird.

  • @pieterrossouw8596

    @pieterrossouw8596

    27 күн бұрын

    Exactly, there's plenty of data streaming stuff available. If you don't need exactly-once delivery, NATS Jetstream is also worth a look.

  • @oggatog3698

    @oggatog3698

    27 күн бұрын

    I was just thinking this...

  • @avwie132
    @avwie132Ай бұрын

    Saved cloud cost, now they have maintenance and ultra-specific-high-payed-developer cost and a self-induced vendor lock-in. Well done. Tens of thousands of vehicles and people isn't special and isn't big at all. Somehow everybody thinks their problem is a unique one. But it isn't. Looking at their proposition it looks like something FlightTracker has been doing for ages..... Writing a blog post about something you _just_ built is always easy because everything appears to work like it should. Now fast forward to 5 years in the future, and see how you handled all the incoming business requirement changes in your bespoke binary format.

  • @woodendoorgarage

    @woodendoorgarage

    28 күн бұрын

    The whole company could be 2 developers in a garage. In which case custom solution that saves OPEX may just be necessary thing to make the company profitable for year or two. I agree it is not very robust solution but the whole thing is so simple you could migrate it to any alternative storage backend (like Kafka, improved storage format, etc.) under a week.

  • @complexity5545
    @complexity5545Ай бұрын

    The title is a play on [ not knowing ] the difference between "database" and "database-engine." Databases are just files that store content. A Database-engine is a CPU process that manages connections (and users) that read||write specific blocks of a data file. It was still an interesting article. Good Video.

  • @hz8711
    @hz871118 күн бұрын

    In similar project, i implemented elastic stack like this: A lot of live logs from thousands of machines > rabbitmq cluster (for buffer if logstashes are not able to handle the load) > logstash cluster (aggregating and modifying logs ) > elasticsearch cluster with well designed indexing and hot-warm-cold index rotation. Sounds like each ride can be a single record, and you can query by ID.

  • @mikeshardmind
    @mikeshardmindАй бұрын

    The thing about the version in the header is spot on, but unlikely to help them here since they want to be able to directly access specific bytes for fast indexing, so all the bytes for that can't ever change meaning. Assuming they haven't already used all of the available flags, the last flag value could be used to indicate another flag-prefixed data section.

  • @hck1bloodday

    @hck1bloodday

    29 күн бұрын

    that would be true if the package has a fixed lenght, but since you can skip sections (hence the has xxx flags in the header) the lenght is variable and they can't just go to specifyc bytes via indexing.

  • @mikeshardmind

    @mikeshardmind

    29 күн бұрын

    The normal purpose of a version as the first field in the header allows everything, including the header, to change. The article (and the video) both discuss indexing on bytes in the header which are always there and not part of the variable capabilities.

  • @siquod

    @siquod

    28 күн бұрын

    Why would you need a version field in every database record? One for the whole database is enough. Or did you think this was a network protocol? As I understand it, it's a file format.

  • @bkucenski
    @bkucenski29 күн бұрын

    There's a work around for the version in header thing. You can run V2 on a different port. But that's less safe than getting your minimum header right out of the gate. Error checking is also a good idea so that if something gets munged up in transit (or you send the wrong version to the wrong destination), a simple math function will mostly guarantee the bits won't math and the message can be rejected. You can also then check what version the bits do math for and give a nice error message.

  • @michaellatta
    @michaellattaАй бұрын

    In their case I would use Kafka to collect the data, and materialize to a database for queries.

  • @7th_CAV_Trooper

    @7th_CAV_Trooper

    Ай бұрын

    Or just leave it in Kafka.

  • @Sonsequence

    @Sonsequence

    29 күн бұрын

    They're not logging anything they don't already need for querying so if they materialized to a DB it would just be the same throughput problem with extra steps. I don't know whether or not they could have made Kafka work performantly for them on the read end for their GIS queries

  • @artursvancans9702

    @artursvancans9702

    29 күн бұрын

    @@Sonsequence not really. the aggregate view might be buffered, flatmapped and updated every 5 seconds or so. the main thing they want is being able to have the data for whatever reason they might need in the future.

  • @michaellatta

    @michaellatta

    29 күн бұрын

    @@Sonsequence if they need every data point yes. But, given they are only keeping 30GB of data they could hold that in RAM using one of the in-memory databases, and let Kafka tiered storage hold the history. No custom serialization required, and a full query engine.

  • @Sonsequence

    @Sonsequence

    29 күн бұрын

    @@michaellatta yeah, just going with in-memory for the recent might be a good option but I don't think there's a GIS DB for that. Would still have to be custom.

  • @nightshade427
    @nightshade427Ай бұрын

    If queries aren't often but collecting the data needs to be fast wonder if something like Kafka/redpanda capturing the data (throughput of 30k+ specified shouldn't be issue for these) and process at their leisure into a view db after it's captures for easy querying would have worked. I don't know their specifics but seems like it might have been simpler? Would even work on premises for some of their clients.

  • @paulmdevenney
    @paulmdevenneyАй бұрын

    I thought the first rule was "don't invent your own security", but I think a close second might be don't invent your own database. If your entire business workflow isn't focused around making that thing better, then you're in for a bad time.

  • @EraYaN
    @EraYaNАй бұрын

    You really don’t need the version per record, per chunk is more than good enough. You are going to do time based migrations anyway so it’s all good (as in start a new chunk at time stamp x with version n+1).

  • @MagusArtStudios
    @MagusArtStudios4 күн бұрын

    In my experience writing databases went extremely well with some caveats over time as branching databases emerged to keep different data sources organized. It was actually so good I turned some data into a chatbot AI with context labeled networks weights and synonym, antonym, noun, and reflection attention mechanisms. Long story short writing databases is so much fun. :)

  • @Amit-sp4qm
    @Amit-sp4qm28 күн бұрын

    Also i think, hiring extra would not be much issue as same application developers are adding this functionality to their app .. In a more simple term they replaced dedicated database and all the handling code to some relatively simple memory writes .. Also probably saved on a few database developers themselves in the team ..

  • @mikemcaulay9507
    @mikemcaulay950713 күн бұрын

    I worked for a company that did tracking of mining and construction equipment and from what I recall they were able to setup Wi-Fi access points at a location to help with triangulation. Pretty sure this is why your iPhone can give such precise locations if they have access to your APs.

  • @hanskessock3941
    @hanskessock394129 күн бұрын

    The amount of storage they claim to create, 100GB per month does not remotely match the storage rates they claim they need - even if they only stored two doubles for lat/long, they would store 40GB per day. Supposedly they store a ton of extra stuff, and they are (weirdly) rolling their own compression as deltas, but those deltas require significant digit precision - it seems like they’re just making things up

  • @ldybdahl
    @ldybdahl2 күн бұрын

    We did something similar - it took 2 ukrainian programmers a couple of months to create an insanely fast system that runs at negligible cost. The costs of developing and using the database engine were lower than the costs of introducing a database like Postgresql into production. The complexity level is comparable to writinh parquet files.

  • @danieltumaini7037
    @danieltumaini703729 күн бұрын

    21:02 best take, for any custom project. gracias concurreagen

  • @josecanciani
    @josecanciani28 күн бұрын

    About the version, my take: I think due to the big size they have, new versions can just be implemented in entirely new nodes. The new nodes will run with the new binaries, only for new data. The reads will do parallel connections to different nodes anyway. There's no need to mix different versions in the same nodes, just clusterize based on version. It doesn't seem they would need it, but if they do, they can migrate the old data eventually, although probably won't make sense unless they need to change format too often.

  • @hemmper
    @hemmper27 күн бұрын

    Storing diff's is a good idea. Like in video codecs like mr Prime said. Also, if some accuracy can be sacrificed, like with lossy compression for video, skipping records and interpolate (linear or "curvy") /calculate them instead when you need them, GPS track pruning. Maybe look at alternative float formats, including store the logarithms of the numbers as ints instead of the floats themselves, which is kind of a little bit of what the usual float formats do, but maybe with more precision bits than you really need in the traditional float bit formats. Traditional RDBMS'es can have user defined functions programmed in common languages, including Java and C and such, and compiled into the database. Those functions can pack/unpack the data directly in SQL and run quite fast. Postgres can also index on the result of such functions. I think most of us should go far in order to NOT create our own database systems. Also most larger database systems need secondary analytics databases where only the data most needed for analytics/statistics are transformed and copied into that.

  • @darkwoodmovies
    @darkwoodmoviesАй бұрын

    At first I thought saving $10k per month was worth it, but then I realized that a single entry-level software engineer costs more

  • @alexsherzhukov6747

    @alexsherzhukov6747

    29 күн бұрын

    merica

  • @darkwoodmovies

    @darkwoodmovies

    29 күн бұрын

    @@alexsherzhukov6747 Huh?

  • @alexsherzhukov6747

    @alexsherzhukov6747

    29 күн бұрын

    @@darkwoodmovies entry level 10k/mo? there is one single place on earth where that could be happening

  • @darkwoodmovies

    @darkwoodmovies

    29 күн бұрын

    @@alexsherzhukov6747 Ooh yeah, true

  • @Narblo

    @Narblo

    28 күн бұрын

    entry level software engineer are 3k/mo

  • @bdafeesh
    @bdafeeshАй бұрын

    This is such a huge decision; I would only trust the most competent teams/coworkers to pull off writing our own database solution... Such a cost to undertake for such a generic use-case. Sure, they have customers and looking to grow, great, pick any of the many open-source options for efficiently storing time-series data. So much more reliable using an already battle-tested product. Not to mention that material already exists for everyone/new team-members to reference and learn from... Don't roll your own database folks. Even when you win, you'll still lose. And to my business friends: Keep simple, more engineers = more cost. Efficient engineers = happy engineers = faster + better products..

  • @research417

    @research417

    28 күн бұрын

    They're going to need to pay for a dedicated team of people to manage and work on this, and I'm pretty sure it'll come out to more than 10k a month...

  • @steffenbendel6031

    @steffenbendel6031

    12 күн бұрын

    Well, I would say it is mainly very simple binary file. They not even did some tricky compression. (I once did a binary file for storing exchange data, that used arithmetic compression on the diffs of the values. Would also fit the requirements since compressing ist faster than decompressing)

  • @ProzacgodAI
    @ProzacgodAI29 күн бұрын

    Versioning can be done per-file / cunk which is how I've handled that in the past, instead of versioning per-record. Another... less reliable way... um, "if (createDate(file_chunk) oh my god have I see the latter a lot over the years.

  • @blarghblargh
    @blarghblargh28 күн бұрын

    Version 0 is the version without a version. Only would work if you get lucky and the fields in that spot don't conflict with the potential version values.

  • @vsolyomi
    @vsolyomi27 күн бұрын

    GPS can be up to cm with some auxiliary groud-based stuff and/or post-processing adjustments

  • @aaronjamt
    @aaronjamt25 күн бұрын

    About the versioning issue: there may be flag bit(s) reserved for future versioning, even one bit is enough. That way, then you can say "if you see this bit set, parse the rest differently" and maybe add a version field at that point. Also, maybe there's some reserved lat/long value they use as an update flag, like 65536 degrees or similar.

  • @steffenbendel6031

    @steffenbendel6031

    12 күн бұрын

    And there might be a header for the file. They only showed a single data entry.

  • @m4cias

    @m4cias

    12 күн бұрын

    @@steffenbendel6031 That's what I thought. Repeating version in each entry would cost extra few % of storage. It would make more sense in case of the broadcasting data between nodes idea.

  • @adamszalkowski8226
    @adamszalkowski8226Ай бұрын

    Reading the requirements, sound like they would be fine saving the data in S3

  • @hanswoast7

    @hanswoast7

    16 күн бұрын

    Yes, they say so in the article^^

  • @tylerbakeman
    @tylerbakeman27 күн бұрын

    0:50 , “When you write your own language, usually it’s after decades of experience”. Part of the reason there are so many languages, is our ability to build frameworks in other languages. If I have a String formatting, that can be parsed from a file, and I have a custom file extension- that’s essentially the same thing. Magic value String formatting is a common practice (and issue), probably moreso in the gaming industry: there are different formats for object data - it is not uncommon to see a game import an asset, build off of those assets, and create either a JSON or a custom file format. So in a sense, developers create their own languages all of the time (not necessarily Large scale multi-purpose languages like Python), and they probably shouldn’t be most of the time, because there are common formats for just-a-bout’ everything.

  • @JeremyAndersonBoise
    @JeremyAndersonBoise29 күн бұрын

    Should have used Redis/ValKey, honestly. Kafka is also a great choice, but I like what they did even though it’s an AOF type log not a DB

  • @tomipanula-ontto2607
    @tomipanula-ontto2607Ай бұрын

    I am not so concerned about the versioning. They could simply have one version per file, or data directory, or maybe it is dependent on the software version. If they need to upgrade, they can easily write a program to read old data and spit out the new format. It is quite common in ”real databases” too.

  • @JamesMurphy1984
    @JamesMurphy198423 күн бұрын

    You can adapt a relational database to use a stored proc (for speed) and just have the coordinates in a bounded box since that’s much more efficient than calculating with a circle. Why did they need a brand new DB solution for it and how much money did it take to build AND maintain it? What about security upgrades and costs associated with that? Crazy stuff.

  • @gammalgris2497
    @gammalgris249729 күн бұрын

    Sounds rather that they don't need a relational database but a transactional database (don't remember the actual name) where they just store each incoming data record. At any given time you can retrace the movement pattern of each tracked entity by going through all stored records. There surely are numerous implementations for that I would guess.

  • @0x0404
    @0x040428 күн бұрын

    This could be a rare example of not having a version field in the header, or even a header at all. They've got the database itself. If they have to change anything, new whatever, stick it in a new database. 1 extra byte per entry when you've got data coming in as fast as it sounds like they are might be too expensive on something that effectively doesn't change.

  • @bergels9408
    @bergels940829 күн бұрын

    It looks like hes describing an avionics data bus standard? ARINC 429 came to mind and seems to fit in the application. It could be that the application is so generic that any shoe could fit, but I wonder if that's whats being used behind the scenes?

  • @velo1337
    @velo1337Ай бұрын

    we track around 400 vehicles and our postgres db is burning. but we also do a lot of computation on that data. its around 12-14k transactions/second

  • @SandraWantsCoke

    @SandraWantsCoke

    Ай бұрын

    What about optimizing the tracking by not tracking too often when the car is on a straight road with no intersections? Or when the speed is 0 track less?

  • @muaathasali4509

    @muaathasali4509

    Ай бұрын

    You should at least use timescaledb with postgres. It's just an extension and it will significantly improve performance. But also if your use case is very analytics heavy, then u should use clickhouse, tdengine, victoriametrics etc.. which are also better for a distributed setting compared to postgres.

  • @velo1337

    @velo1337

    29 күн бұрын

    @@SandraWantsCoke standing time is valuable data

  • @BosonCollider

    @BosonCollider

    29 күн бұрын

    If you are not using timescaledb, make sure to use BRIN indexes.

  • @LtdJorge

    @LtdJorge

    28 күн бұрын

    @@muaathasali4509Those other database engines you suggest are all OLAP which are very, very bad at many TPS. Op is better served by something like Timescale or InfluxDB.

  • @zxuiji
    @zxuijiАй бұрын

    15:59 They could just make one of the available flags mean "has extended flags" or "has version"

  • @Bozebo

    @Bozebo

    29 күн бұрын

    They might be able to get away with assuming the version from the time too? Similar to flags it'd be better at the start of the header though if used for version; could get away with it if production is only expected to read the latest version and not older versions too.

  • @tsx7878

    @tsx7878

    29 күн бұрын

    It’s simpler than that really. Prime is confused here: this is not a wire protocol. It’s an on disk format. You put the version in the file header. When you deploy a new version it starts to write to a new file. But can still read the old files.

  • @zxuiji

    @zxuiji

    29 күн бұрын

    @@tsx7878 I know what flags are, I suggestes using them for adding the versioning because that's the easiest way to check what type of object was handed to them without modifying the original object. The lack of the appropriate flag says it's original object, anything else is a newer object. The header can then be modified to expose a new function that excepts a void pointer instead of a predefined type. The old functions source can be renamed to this new function and slightly modified to be a wrapper to the new function, thus retaining the "one root function" rule for stable behaviour without breaking the existing ABI

  • @sullivan3503

    @sullivan3503

    28 күн бұрын

    @@tsx7878 Thank you. I had this exact thought.

  • @DKLHensen
    @DKLHensenАй бұрын

    If stuff like this counts then I'm a database developer as well, putting that on my resume right now! thanks, another good video

  • @jobko88
    @jobko88Ай бұрын

    Who am I to question their decision, but wouldn't it have been easier/faster to have a black box in each car instead and do writes to the server in batches?..

  • @petermeshackjobs8076
    @petermeshackjobs8076Ай бұрын

    i created mine plus with database API and hosted it locally and am doing fine

  • @musicalducky6623
    @musicalducky662329 күн бұрын

    Instant like for the version field.

  • @pawol9315
    @pawol93154 күн бұрын

    "That's probably Machhhron's creation" Love it!

  • @scottspitlerII
    @scottspitlerII9 күн бұрын

    I literally just saved $15k a month moving off of AWS to another cloud vendor. It’s insane how expensive the cloud is getting

  • @Delfigamer1
    @Delfigamer129 күн бұрын

    I don't think the individual update frames are ever present by themselves. In the storage, they must be bundled together into large blocks - and then you can write the version in the file header, since you won't ever mix multiple update-frame versions in a single block. The same goes for the real-time updates - they must be happening in the context of some persistent connection, and so there you can negotiate the version during the connection's handshake. Thus, you don't need to version each individual frame, that would actually be a waste of already precious space. It's like if, in HTML, you'd be writing a DOCTYPE for _every individual tag_ instead of just having a single one for an entire document.

  • @bigbug1991

    @bigbug1991

    28 күн бұрын

    Thank you! Thought exactly the same while watching the video.

  • @sullivan3503

    @sullivan3503

    28 күн бұрын

    Yeah, him saying this caught me off guard. Pretty sure the only reason we have versions in things like internet packets is because there is physical hardware in the loop that has static circuits based on the version of the packets.

  • @steffenbendel6031

    @steffenbendel6031

    12 күн бұрын

    I agree.

  • @635574
    @635574Ай бұрын

    I think they bet on the fact that location data format will never ever change and the version of the software willl be irrelevant for it.

  • @fb-gu2er
    @fb-gu2erАй бұрын

    We do about 60-70k transactions per second on average. With peak hour much higher. This is not a whole lot

  • @r9999t

    @r9999t

    Ай бұрын

    Yeah, we were doing 20K transactions per node at an adtech company 10 years ago, so 60-70K should be cake today.

  • @muaathasali4509

    @muaathasali4509

    Ай бұрын

    Yeah... I don't really get it. A cheap PC can handle 100k+ writes per second with batching

  • @TheofilosMouratidis

    @TheofilosMouratidis

    Ай бұрын

    per database node?

  • @jdahern

    @jdahern

    Ай бұрын

    I was thinking the same thing. There load is not that high. It sounds more like poor indexing or a bad set of hardware for the on premise clients.

  • @SimonBuchanNz

    @SimonBuchanNz

    29 күн бұрын

    Transaction != Transaction. You can't just compare incrementing a view counter to whatever GIS magic is going on here; 30k/s might be easy, it might be impossible.

  • @gjermundification
    @gjermundification27 күн бұрын

    3:47 5m x 5m, however with accelometer and other movement trackers; such as RTK it's possible to calculate way better data. Such as triangulation of 5G...

  • @mup3217
    @mup321727 күн бұрын

    you dont need to save "version" on row level (in your format) - It wastes storage space! when you have change in format, just save it on another table and call it "list_v2".

  • @manafount2600
    @manafount260025 күн бұрын

    I'm at a company that ended up writing their own DB for time-series data. The scale is much larger, both in terms of our engineering organization (thousands) and the amount of data processed (~500 billion writes/day, trillions of reads/day). We can accept similar sacrifices in consistency, but our use case and the data we store aren't quite as specific. All of the things you pointed out about engineering hours for developing and maintaining a custom DB are spot on - cost savings, even at our scale, are not a good reason to roll your own DB. Maybe if we were Google-scale that'd be different, though...

  • @neo-vj4zq
    @neo-vj4zqАй бұрын

    Honestly we use an off the shelf solution, enterprise but external company and this level of throughput is trivial.

  • @Kenjuudo
    @Kenjuudo2 күн бұрын

    They have an "entry length" field that effectively works as a version number.

  • @davidjohnston4240
    @davidjohnston424021 күн бұрын

    Writing a DB doesn't scare me. I've done it. Local, ACID, backups and stuff. Very closely tied to the application (a physical store) with a great/fast/easy curses based text UI at the checkout that the staff loved. When you know the theory and you know your target, a custom DB is 1000X more efficient and 1000X faster.

  • @Lampe2020
    @Lampe202027 күн бұрын

    17:07 Well, then I've done everything right with L2DB (a binary config file format originally intended to be a database, therefore the name), which has its eight magical bytes (\x88L2DB\x00\x00\x00), directly followed by three unsigned shorts for major, minor and patch format version. After that comes all the fun part with the flags and other stuff.

  • @r9999t
    @r9999tАй бұрын

    Why not use some streaming database solution? There's Flink and I'm not sure if they've changed name after they were acquired, but there used to be SQLstream (which I think is called Kinesis Analytics, or something similar, inside AWS). Both of those would probably handle all these requirements, and not require you to write your own database. Also if you really need to, you can always add blobs with a custom data format to compress the on-disk size. There will probably be a bit of shoehorning to get every element to work, but nothing like the effort of writing your own database. Also various messaging systems might work as well, but then the querying might be more difficult and/or limited.

  • @steffenbendel6031

    @steffenbendel6031

    12 күн бұрын

    But if you mainly just want to write a file, just write the file. Like Elon would say, it is not that complicated. Does Netflix puts their movies into normal databases? (Well certainly not into a DB hosted with someone else)

  • @krellin
    @krellinАй бұрын

    they effectively did what fintech companies do, you cant use a DB as a service if you truly want exceptional performance, your application must be a DB on its own, like a specialised db.. they all have many common features. Also engineering cost while high is one off cost compared to your ever increasing cloud costs, once engineering did the job its significantly cheaper to run and maintain.

  • @efkastner
    @efkastnerАй бұрын

    with that many writes a second, UDP is probably not a great idea - without tweaking the network stack, they’ll likely be dropping a lot of packets and need to spend more engineering effort there (source: I wrote* StatsD (original idea from Flickr, taken with permission) (nb: i haven’t touched linux networking in a decade, it might be WAY better now)

  • @creativecraving

    @creativecraving

    Ай бұрын

    Maybe they rewrote TCP atop UDP, too? 😂

  • @rogierlodewijks8646
    @rogierlodewijks864627 күн бұрын

    Just introduce a bit-flag indicating a version field. Also 0x80 bit set mean: additional flag field next.. and bam... scalez 2 infinity!

  • @jordixboy
    @jordixboyАй бұрын

    Dont forget about load balancers, they are expensive as hell aswell...

  • @MichaelScharf
    @MichaelScharfАй бұрын

    Use one of the flags as „next byte is version“, and that’s it. No need to have overhead of a version now

  • @nathanpotter1334
    @nathanpotter133428 күн бұрын

    I showed my co-workers the Tigerbeetle demo - Easily the coolest demo ever

  • @qkktech
    @qkktech28 күн бұрын

    As always it is about coding when you use gps wgs data not metric then you have very bad calculations but wgen you use mercator or similar encoding then calculations are metric

  • @Youtube_Stole_My_Handle_Too
    @Youtube_Stole_My_Handle_Too25 күн бұрын

    RTK is completely irrelevant in this context. RTK is for base-rover configuration, not loose vehicles on random roads. Better accuracy comes from differential GPS, better receivers and signal processing, accelerometers, and in some cases, steering data from the onboard computer.

  • @mrcuddles90
    @mrcuddles9027 күн бұрын

    Don't try to re-invent the wheel. There are a lot of nicely polished wheels out there.

  • @orlovskyconsultinggbr2849
    @orlovskyconsultinggbr284928 күн бұрын

    hm, let me think for a minute, how i would solve this problem? Well java clustering is a thing, why to write on disk ? Why not to write in memory and the store only whenever it necessary and make sense. Theirs dependency on cloud is beyond me. They would probably better of with own server equipment. Really, i respect this guy, his wish was to be creative , very creative, i most certainly not creative , i say before i build something new i will look for some ERP business ready system hook them into chain and have my operation cost covered , and if those ERP do not cover all of my requirements , then i will build extensions.

  • @PeterVerhas
    @PeterVerhas12 күн бұрын

    1 byte for version is enough. If not, then before you run out allocate a new version byte in the new version (sub version kind of).

  • @technokicksyourass
    @technokicksyourassАй бұрын

    I wonder if they just needed to write an encoding scheme, rather than an entire database. One things developers often miss is the operational impacts in production of doing things custom. It's easy to find a support/operations team that can handle backups, routing, monitoring, disk management and so on for off the shelf solutions. When you do it all custom, you often end up having to manage the operations yourself as well.

  • @research417

    @research417

    28 күн бұрын

    Very true. I'm actually a little confused about why they needed to do this at all. Their main problems were that they needed extremely high write performance (up to 30k locations per sec), and they want the data to take up as small a size as possible. To solve this, they developed a binary encoding scheme, only stored the full object state every 200 writes (and between updates just stored the changed fields), and they batched their updates to one write per second per node. They also moved everything above 30gb in their new scheme to AWS Glacier because they don't need speed for data that old. Like you said, I feel like they could've stopped after the encoding scheme part? I feel like they would've saved enough money to the point where the rest was achievable without writing their own database. Even if they did nothing at all, 10k a month isn't bad at all? That's basically the salary for a single database engineer, and now they need to handle maintaining and updating the software, backups, routing, monitoring, and disk management, etc. And that's not talking about how much it costs to pay the engineers to build and validate this system originally. Maybe after 10 years it'll be viable (not accounting for the fact AWS and other options will likely be cheaper then), but I feel like I'm missing something because this doesn't seem like a good solution at all?

  • @michaellatta
    @michaellattaАй бұрын

    Building a special purpose database can make sense, but it requires very special skills and requirements to make sense.

  • @creativecraving

    @creativecraving

    Ай бұрын

    It takes specialized skills to make a general-purpose database; but if you're targeting a specific use case --- where the workload is greatly reduced --- , it makes sense to try something out and see how it goes. If it's bad, you can quickly iterate.

  • @cravecode1742
    @cravecode174229 күн бұрын

    I feel like they solved the wrong problem. Something like Azure’s Event Hub or AWS’s MSK on the ingestion. Partition and store as you’ve chosen. Have meta data in a common DB for assembling needed queries. I’ve faced similar sounding obstacles for financial transnational data

  • @Omnifarious0
    @Omnifarious029 күн бұрын

    I wrote my own database. But, it was also 1991. And I could guarantee that there was only one reader or writer at any given point in time.

  • @Tobarja
    @TobarjaАй бұрын

    As Mr. Warbucks said: "Did I just do a commercial?!"

  • @daniivanov4554
    @daniivanov455425 күн бұрын

    you remind me of my first boss, very cool person

  • @7th_CAV_Trooper
    @7th_CAV_TrooperАй бұрын

    Seems like an off the shelf LSM storage engine would get the job done.

  • @2Fast4Mellow
    @2Fast4Mellow28 күн бұрын

    A database that looses data is NEVER fine! If your application is fine with loosing data, it should be the application that filters what actually is written to a database. Databases cannot 'buffer' writes. Databases should be atomic (that is why we use transactions) and once the database chooses which updates to keep and which not, this is very dangerous. Versioning could be written as part of the file identifier, so you don't have to store version in every record. I feel they needed a timeseries database.

  • @emaayan
    @emaayan29 күн бұрын

    good luck finding hiring people to maintain that, everyone who's not desperate and knows what they are going into, would run as soon as they'll hear we've written our own custom db, meaning they won't get experience in well known frameworks and get vendor locked , and it would get worse once the original architects would move away.

  • @Valeriano.A.R
    @Valeriano.A.R29 күн бұрын

    The field format is not defined as a network protocol format. The version could be in the header of the file/blob.

  • @dandogamer
    @dandogamerАй бұрын

    Not sure if I'm missing something but couldnt this have been done with any streaming technology I.e. Nats or kafka and targeted EBS as the sink?

  • @pertsevds
    @pertsevdsАй бұрын

    The name - is The Versiongen!

  • @tamtrinh174
    @tamtrinh17429 күн бұрын

    they said it's better but they didn't say it is a lot more expensive

  • @roadrunner3563
    @roadrunner3563Ай бұрын

    Not surprising. I almost always wrote my own database tailored to the specific dataset. Performance was always MUCH better than a commercial (or freeware) database. Multiuser access across network users was included, but I didn't require a web interface. It was all native app.

  • @CamembertDave
    @CamembertDave24 күн бұрын

    "In process storage engine that's part of the same executable as our core server." This is the way.

  • @LionKimbro
    @LionKimbro28 күн бұрын

    They might be versioning via the flags; They've got 16 bits in there for it, and I imagine a bunch are reserved.

  • @Amit-sp4qm
    @Amit-sp4qm28 күн бұрын

    Why put version in every packet when you dont have to, protocol change, while a single session of continuous writes?

  • @linkfang9300
    @linkfang9300Ай бұрын

    Before I clicked play, I thought it's "using cloud services VS using a VM hosting a whatever database". So, it is actually writing a database from scratch...😱

  • @belabour1
    @belabour128 күн бұрын

    You can put the protocol version to the file name, no need to include it in every data item.

  • @ashutoshmittal3292
    @ashutoshmittal329228 күн бұрын

    Feller like i discovered a new genre of KZread videos

  • @johnneiberger7311
    @johnneiberger731118 күн бұрын

    lol I was thinking that this should have been written in Erlang and someone else mentions it.

  • @romaincramard5301
    @romaincramard530125 күн бұрын

    The take on header version is not to be on each object but more on the storage file level

  • @smooth1x
    @smooth1x16 күн бұрын

    For TCP Stream why not have a header that describes which version of the protcol you are using for the whole stream say 4 bytes? The receiver then maps that protocl version to the version of encoding of each object on the stream, you hardly change version in the middle of a TCP connection!

  • @sullivan3503
    @sullivan350328 күн бұрын

    Why can't you just store the version field at the beginning of each database file? Why does it need to be in the "packet?"

  • @anonym-hub
    @anonym-hub27 күн бұрын

    @ 15:25 The "Timestamp" can-be/is the "version", no big mistake.

  • @magfal
    @magfalАй бұрын

    I wonder how close the performance would be for Clickhouse, Hydra Columnar with postgis or Timescale.

  • @LtdJorge

    @LtdJorge

    28 күн бұрын

    Edit: now that I reread the article, I’ve noticed I was thrown off by the claim that they need extremely high write performance. In reality they have high writes, but they don’t seem to need that data instantly and they don’t need consistency. So now I think just put a Kafka cluster in front of the writers and index every X minutes into ClickHouse, then let your app read from CH. 30k/s for CH might be high, but by batching the inserts from Kafka it doesn’t sound that big. I don’t think Clickhouse would be a good candidate. It’s extremely optimized for read queries on massive amounts of data. What these guys seem to require is a DB with a high rate of transactions per second.

  • @lassemelcher7749

    @lassemelcher7749

    11 күн бұрын

    +1 same idea

  • @siquod
    @siquod28 күн бұрын

    Why would you need a version field in every database record? One for the whole database is enough. Or did you think this was a network protocol? As I understand it, it's a file format.