Dev Deletes Entire Production Database, Chaos Ensues

Ғылым және технология

If you're tasked with deleting a database, make sure you delete the right one.
Sources:
about.gitlab.com/blog/2017/02...
about.gitlab.com/blog/2017/02...
Notes:
1:05 - The middle bullet point about the account that had 47,000 IPs was never mentioned in the postmortem (there was an initial report the day of and a more detailed postmortem a bit over a week after that). Perhaps that was a red herring which they figured out later on didn't really matter.
3:07 - I made the error say too many open connections since it's easier to understand than semaphores
3:39 - This part was confusing, since the postmortem and the initial report conflicted. The postmortem said the engineers believed pg_basebackup was failing because previous attempts created some files in the data directory, but the initial report said the theory was because the data directory existed (despite being empty). So for some reason the engineers really wanted to delete the data directory, but for what reason who knows.
4:37 - They probably didn't check for backups in this order. I'm sure team-member-1 immediately called out he had taken a backup 6 hours earlier, and then they just had to verify the other backups in case there was a better one.
6:21 - Being reported by a troll will not automatically remove a user, but flag it for manual review. It was then incorrectly deleted after review.
Chapters:
0:00 Seconds before disaster
0:16 Part 1: Database issues
2:21 Part 2: The rm -rf moment
4:32 Part 3: Restore from backup
6:13 Part 4: Post incident discoveries
7:27 Lessons learned
9:46 The fate of team-member-1
10:11 ???
Music:
- Thriller Trailer Teaser Tense by Cold Cinema • Thriller Trailer Tease...
- Finding the Balance by Kevin MacLeod
- Eyes Gone Wrong by Kevin MacLeod
- Desert City by Kevin Macleod
- Jane Street by TrackTribe

Пікірлер: 2 600

  • @VestigialHead
    @VestigialHead Жыл бұрын

    Damn I cannot even imagine the stress that admin was feeling after he realised he deleted DB1. He must have aged twenty years.

  • @1996Pinocchio

    @1996Pinocchio

    Жыл бұрын

    The legendary Onosecond.

  • @NS-sd3mn

    @NS-sd3mn

    Жыл бұрын

    ​@@1996Pinocchio I see that you see tom scott

  • @youngstellarobjects

    @youngstellarobjects

    Жыл бұрын

    The stress should really be minimal if you have a backup and restore procedure, that it actually works and you know how it works. Mistakes happen.The problem wasn't the delete command, it was the nonexistent backups and documentation.

  • @LeoVital

    @LeoVital

    Жыл бұрын

    @@youngstellarobjects Nah, still stressful. Most companies aren't making a backup on every write that happens to a DB, so whoever deletes a DB knows that they've just made an oopsie that will cause a lot of headache for multiple people. And probably cost a lot of money for the company as well.

  • @pqsk

    @pqsk

    Жыл бұрын

    As long as you have a backup there's no problem. I've done this before, but if there's no backup you prolly die of stress 😅😅😅

  • @Misanthrope84
    @Misanthrope84 Жыл бұрын

    "You think it's expensive to hire a professional? Wait till you hire an amateur" - some old wise businessman.

  • @urbexingTss

    @urbexingTss

    Жыл бұрын

    that indeed is wise

  • @shahriar0247

    @shahriar0247

    Жыл бұрын

    Loll

  • @blue5659

    @blue5659

    Жыл бұрын

    A professional costs you in bold italic and underline. An amateur mostly costs you in fineprint

  • @-na-nomad6247

    @-na-nomad6247

    Жыл бұрын

    The person here is not an amateur, anyone can get brain farts especially when working an unexpected overnight, you should try it sometime, you'll start seeing ducks and rabbits in the shell.

  • @Misanthrope84

    @Misanthrope84

    Жыл бұрын

    @@-na-nomad6247 I'm a veteran in the Devops field. This comedy of mistakes could have never happened to me since I'm following a protocol, which these guys obviously did not. They were guessing and experimenting as if it were an ephemeral development environment. Their level of fatigue had little to do with their incompetence in understanding the commands they were running.

  • @Chris_Cross
    @Chris_Cross Жыл бұрын

    The fact they live streamed while trying to restore the data is a truly epic move.

  • @xpusostomos

    @xpusostomos

    7 ай бұрын

    Hope it was monetized

  • @godjhaka7376

    @godjhaka7376

    6 ай бұрын

    ​@@xpusostomosthat's why they live stream and post anyway. Not to educate but rather make money

  • @Elesario

    @Elesario

    5 ай бұрын

    Sounds like they had the spare bandwidth ;P

  • @joseaca1010

    @joseaca1010

    4 ай бұрын

    Programmer vtuber when?

  • @kv4648

    @kv4648

    3 ай бұрын

    ​@@joseaca1010already have one: vedal

  • @Webmage101
    @Webmage10111 ай бұрын

    I think the biggest problem (seemingly addressed at 6:21) is the fact they could delete an employee account by spam reporting it.

  • @alex_zetsu

    @alex_zetsu

    11 ай бұрын

    Actually at the time of the video, what they addressed was the fact that deleting an account could cause problems with the server, it seems they didn't actually stop trolls from deleting an employee's account. I'd have thought employee accounts would be protected. The trolls didn't even get admin powers through privilege escalation, they just reported the target.

  • @Milenakos

    @Milenakos

    10 ай бұрын

    read the video description

  • @DevinDTV

    @DevinDTV

    10 ай бұрын

    @@Milenakos every company says they do a manual review, but none of them actually do

  • @Milenakos

    @Milenakos

    10 ай бұрын

    ​@@DevinDTV source??? (edit: i was mostly complaining about you just saying they are lying out of thin air)

  • @Therealpro2

    @Therealpro2

    10 ай бұрын

    ​@@Milenakos source????????????????????????????????????????????

  • @SIMULATAN
    @SIMULATAN Жыл бұрын

    So you're telling me a platform as big as GitLab went down because one engineer picked the wrong SSH session? Damn that makes me feel way better about my mistakes lol

  • @shahriar0247

    @shahriar0247

    Жыл бұрын

    i would highly high suggest people using customized shells, i use oh my zsh, i customize my themes to show git info, hostname (sometimes) and a lot more, not because i wanna know which ssh session im in, but i like the design :)

  • @0xCAFEF00D

    @0xCAFEF00D

    Жыл бұрын

    @Syed Mohammad Sannan No someone has to have that. The general problem is that there's no safety nets. I don't mean to suggest this is a good solution, because safe-rm is just jank. But using safe-rm would most likely have saved this situation. If you replace rm through a symlink to safe-rm you can configure a blacklist on production that doesn't allow for deleting the database or other critical data. I find many things about safe-rm to be unsafe. It doesn't protect if you cd into a directory and then do rm -rf *. A better program should simply evaluate the path its trying to delete and disallow it if the blacklist covers it. It also doesn't allow for custom messages through its blacklist. What you want is for a bad rm -rf to send a warning to the user. Otherwise there's no way of guaranteeing they don't just start avoiding the issues. For example, most likely you're not going to leave your backup unprotected by the blacklist just to create differences between production and backup. So a developer in this situation would expect to run into issues deleting postgres db on either server. It doesn't tell the user anything really. If you instead configure messages you can call attention to the hostname. The goal is just to induce further friction for dangerous actions. rm has always been so risky because it's so easy.

  • @Darkk6969

    @Darkk6969

    Жыл бұрын

    @@0xCAFEF00D I always check the hostname of the server and triple check the directory before using the rm -rf command. If in doubt I use the mv command to a different directory as backup. If everything works ok then I go in there and delete the old directory. Same thing happened to Pixar's movie Toy Story they were working on. Some storage admin used rm -rf on a directory by mistake and practically wiped out the movie. Lucky someone had a copy of the data on a laptop that was offsite at the time. They were able to rebuild the movie from that data.

  • @BuyHighSellLo

    @BuyHighSellLo

    Жыл бұрын

    @@0xCAFEF00D no, NO single employee should have enough privilege to bring down anything business sensitive. except if you’re the CTO maybe. These operations all should require a flag or check from someone else first. Just like how one person usually shouldnt be able to push any code by themselves. They need 1 or more checks before that.

  • @desoroxxx

    @desoroxxx

    Жыл бұрын

    @@shahriar0247 I try to make my prod env glow red like that even if I am tired I can see it

  • @jarrod752
    @jarrod752 Жыл бұрын

    _Luckily team 1 took a snapshot 6 hours before..._ This happened to me. I copied a clients database to my development environment about 2 hours before they accidently wiped it. They called our company explaining what happened and it got around that I had a copy. Our company looked like a hero that day, and I got a bunch of credit for good luck.

  • @abelkibebe577

    @abelkibebe577

    Жыл бұрын

    You are a Legend :)

  • @mipmipmipmipmip

    @mipmipmipmipmip

    Жыл бұрын

    I think this was how most of Toy Story was saved. It's also bad security practice :)

  • @ilyasziani5504

    @ilyasziani5504

    Жыл бұрын

    @@mipmipmipmipmip Why is it bad security practice?

  • @amyx231

    @amyx231

    Жыл бұрын

    And now you routinely copy the client database every 24 hours?

  • @jarrod752

    @jarrod752

    Жыл бұрын

    @@amyx231 Actually, due to the nature of my current work, I have a script I run on demand approx every few days as needed that takes a snapshot. I usually get around to deleting everything that's more than a month old about twice a year or when my dev server starts btching about space.

  • @Nick77ab2
    @Nick77ab2 Жыл бұрын

    This is why problems like this are actually sometimes good. Of course extremely stressful, but they found sooo many issues and fixed them all. Amazing.

  • @federicocaputo9966

    @federicocaputo9966

    Жыл бұрын

    you are asuming they fixed them all Until it breaks again.

  • @JeyC_

    @JeyC_

    Жыл бұрын

    ​@@federicocaputo9966 atleast next time they now have the experience to what not to do or what to do

  • @brett2258

    @brett2258

    Жыл бұрын

    That's a really good positive approach right there!

  • @djweavergamesmaster

    @djweavergamesmaster

    Жыл бұрын

    reminds me of that one ProZD skit, where the villain fixes everything

  • @mikabakker1

    @mikabakker1

    10 ай бұрын

    @@federicocaputo9966 that is life

  • @dragonfire4869
    @dragonfire4869 Жыл бұрын

    This reminds me of Toy Story, and how like a month before release the entire animation was accidentally deleted, causing absolute panic and hell at Disney. Luckily, one employee had the whole thing on a hard drive that they were taking home to work on. Her initials are on one of the number plates of one of the cars in the film. Always make a backup. Edit: She was a project manager who had to work from home, and the numberplate was actually "Rm Rf" in reference to the notorious line of code that did it.

  • @mrsharpie7899

    @mrsharpie7899

    Жыл бұрын

    I don't remember if it was the day-saving employee's initials, or RM-RF that was on the license plate

  • @alimanski7941

    @alimanski7941

    Жыл бұрын

    It was Toy Story 2, and the easter egg was in Toy Story 4, where the license plate had "rm rf" in it

  • @ScruffyNZ.

    @ScruffyNZ.

    Жыл бұрын

    they fired that person recently

  • @atulyadav3197

    @atulyadav3197

    Жыл бұрын

    @@ScruffyNZ. Yes, I heard this too

  • @GoatzombieBubba

    @GoatzombieBubba

    11 ай бұрын

    @@ScruffyNZ. That person should be happy to not work for a woke company like Disney.

  • @rosscads
    @rosscads Жыл бұрын

    Given the trouble they were in after the deletion, a recovery time of 24h and a recovery point of 6h is actually pretty heroic. Especially considering the stress they would have been under. 😰

  • @TheDaern

    @TheDaern

    Жыл бұрын

    ​@@L2002 Because of this? They were open and honest about their screwups which, for me, makes them a pretty good organisation to deal with. Plenty of others would not be and, at the end of the day, this stuff does happen from time to time. My measure of a company is not how well they work day to day, but how they handle adversity. Everyone screws up eventually and it's how you handle this that marks out the good ones from the bad ones. Also, a company who almost lost a production DB because of failed backups is unlikely to do it again ;-)

  • @MunyuShizumi

    @MunyuShizumi

    Жыл бұрын

    @@L2002 Ah, yes, because Microsoft never has outages, data loss, or data leak incide- oh wait..

  • @sinnlos229

    @sinnlos229

    Жыл бұрын

    ​@@L2002Care to elaborate? Cause everyone else here, including me, disagrees.

  • @titan5064

    @titan5064

    Жыл бұрын

    Don't feed the troll, clearly not someone who's ever worked with computers on a proper level

  • @realpillboxer

    @realpillboxer

    Жыл бұрын

    @@titan5064 exactly. Their handle is "L" -- they are a literal walking loss (loser).

  • @maxcohn3228
    @maxcohn3228 Жыл бұрын

    Something my first boss taught me (when I broke something big in production in my first few weeks) is that post mortems are to identify problems in a system and how to prevent them, avoiding blame to individuals. This is huge. Making sure to identify why it was even possible for something like this to happen and how to prevent it in the future is a great way to handle a post mortem like this. Good on the GitLab team.

  • @lhpl

    @lhpl

    Жыл бұрын

    Good boss. Bad ones often like when things are done fast and "efficient". And when this then establishes a culture of unsafe practies, thing will go fine, maybe for a long time. This one day, a human error occurs. Typically, such a boss will then blame the person who "did" it, even if the cause was the unsound culture. If as an employee you try to work safely, you get criticised for being slow and inefficient (and you technically are.)

  • @FireWyvern870

    @FireWyvern870

    Жыл бұрын

    Yeah, things like this are the problem of the system, not fault of the operators

  • @honkhonk8009

    @honkhonk8009

    Жыл бұрын

    You only fire people for their character, not cus of the inevitable fuckup. Also you basically sunk money into training this dude after that fuckup, so sacking him right after you inevitably paid to get him that experience, is counterproductive.

  • @gownerjones

    @gownerjones

    Жыл бұрын

    Also very cool that they did it completely in public even with livestreams. This will hopefully help other companies avoid mistakes like that.

  • @FlabbyTabby

    @FlabbyTabby

    Жыл бұрын

    Depends. Many times, it's used as on opportunity to kick out people they consider undesirable, even if they're great employees.

  • @CryShana
    @CryShana Жыл бұрын

    When I was still a junior developer at some startup company, I was working on a specific PHP online store. Every time we would upgrade the site, we would first do it on Staging, then copy it over to Production. The whole process was kinda annoying as there was no streamlined upgrade flow yet and no documentation anywhere - it was a relatively new project we took over. I have upgraded it before so I knew what to do, and I just did the thing I always did. I was close to finishing it up and we had an office meeting coming up soon and lunch afterwards, so I wanted to be done with this before that - so I rushed a bit. And when I was copying files to Production, I overlooked something - I had also copied the staging config file (that contained database access info etc) to the production location and overwrote the production config file. After the copying had finished, thinking I was finally done, I relaxed and prepared myself for the meeting. As I was closing everything, I also tried refreshing the production site, just to see if it works. And then I realized... Articles weren't appearing, images weren't loading, errors everywhere. Initially I didn't believe this was production at all, probably just localhost or something, RIGHT?? However after re-refreshing it and confirming I had actually broke production, panic set in. Instead of informing anyone, I quietly moved closer to my computer, completely quiet, and started looking at what is wrong - with 100% focus, I don't think I was ever as focused as I was then - I didn't have time to inform anyone, it would only cause unnecessary delays. I had to restore this site ASAP. I remember sweating... the meeting was starting and I remember colleagues asking me "if I am coming" - and I just blurted "ye ye, just checking some things..." completely "calmly" as I was PANICKING to fix the site as soon as possible. Luckily I quickly found the source of the mistake within a minute and had to find a backup config file - and then after recovering the config file, everything was fixed. Followed by a huge sigh of relief. The site must have been down for only around 2 minutes. No one actually noticed what I had done - and I just joined the meeting as if nothing had happened - even though I was sweating and breathing quickly to calm myself down, I hid it pretty well. And this was a long time ago - and still to this day, I still remember that panic very well. Now I always make sure I have quick recovery options available at all times in case something goes wrong - and if possible always automate the upgrade process to minimize human errors

  • @valdimer11

    @valdimer11

    3 ай бұрын

    Well done. Having made mistakes like that, I can completely understand how you were feeling in that moment and how your brain just went "in the zone". It's only ever happened to me twice but I will NEVER forget them.

  • @yt-sh

    @yt-sh

    28 күн бұрын

    Good lessons, thank you

  • @vjndr32

    @vjndr32

    26 күн бұрын

    Mann, we all have our fair share of breaking production.

  • @obanjespirit2895

    @obanjespirit2895

    25 күн бұрын

    I did something similar but with testing on what i thought was dev server. Had some close calls but this time i fcked up. Was super high but was always high so doubt that was it. Quickly had to go and undo changes but was so shook had to make a chrome ext that would put up some graphics and ominous 40k mechicus music whenever i go on a live domain. Havnt made the same mistake since.

  • @JeffThePoustman
    @JeffThePoustman Жыл бұрын

    Ugh, felt that "he slammed CTRL+C harder than he ever had before" (3:55). The only thing worse than deleting your own data is deleting everyone else's. In this case the poor guy kinda did both. Great story arc.

  • @ic6406

    @ic6406

    5 ай бұрын

    Yeah, I guess it was the most stressful moment in his life after realizing what you've done. I think he had a huge blackout

  • @ludoviclagouardette7020
    @ludoviclagouardette7020 Жыл бұрын

    The rule I apply for backups is that no one should connect to both a backup server and a primary at the same time, two people should be working together. The employee that was logged on both DBs should have been really two physically separated employees

  • @act.13.41

    @act.13.41

    Жыл бұрын

    That is an excellent rule.

  • @refuzion1314

    @refuzion1314

    Жыл бұрын

    Yes, but, in the case that there is only one employee available and he has to connect to both he should either have different color schemes for the different servers OR do it all in one shell window and disconnect / connect to the server they have to edit that way it is a lot harder to execute commands on the wrong server.

  • @thoriumbr

    @thoriumbr

    Жыл бұрын

    I try to follow this rule myself. Every time I have to connect to a prod server to get anything, I disconnect as soon as I get the info before getting back to the test/dev server window.

  • @thoriumbr

    @thoriumbr

    Жыл бұрын

    @@refuzion1314 Different color schemes looks good but don't work during an outage, when you are stressed, exhausted, or anything distracts you. Sounds nice, but the mental load during crisis is too large to pay attention to that.

  • @onemprod

    @onemprod

    Жыл бұрын

    I can't tell you enough how easy it is to accidentally overwrite the wrong file. While I was working on something on a test machine with a usb stick plugged in to save the current progress, I saved the script, thought I saved it in the local directory and copied the unmodified script to my just saved usb stick version...

  • @gosnooky
    @gosnooky Жыл бұрын

    Imagine for a moment, that you're that guy. That feeling of pure dread and the adrenaline rush immediately after the realization of what you've just done. We've all felt it at some point.

  • @omniphage9391

    @omniphage9391

    11 ай бұрын

    In my first job, ive gotten a 2 am call where in the first two weeks of working in the company, i accidentally left a process in prod shut down after maintanence, leading to intensive care patient data not making it into connected systems. Looking back, the entire company was set up super amateurish, yet they operate in several hospitals in my country.

  • @PixelSlayer247

    @PixelSlayer247

    11 ай бұрын

    Having exited my game without being sure I saved my progress before, this is very relatable.

  • @thephlophers

    @thephlophers

    10 ай бұрын

    the onosecond

  • @stacilynn604

    @stacilynn604

    10 ай бұрын

    like hitting a car in a parking lot 😵

  • @ashesagainst7236

    @ashesagainst7236

    10 ай бұрын

    At my second IT job I accidentally truncated an important table in the prod DB. The stress was immense but we identified a ton of issues and the team was pretty supportive. My boss ended up begging upper management to get us a backup server but they determined it wasn't important enough. The company went belly-up a few years later because of a ransomware attack they couldn't recover from.

  • @mxbx307
    @mxbx307 Жыл бұрын

    There is an awful lot that could be learned from this. 1) You should "soft delete" i.e. use mv to either rename the data e.g. renaming MyData to something like MyData_old or MyData_backup, or just mv it out of the way so you can restore it later if needed. Don't just rm -rf it from orbit 2) Script all your changes. Everything you need to do should be wrapped in a peer-reviewed script and you just run the script, so that the pre-agreed actions are all that gets done. Do not go off piste, do not just SSH into prod boxes and start flinging arbitrary commands around 3) Change Control - as above 4) If you have Server A and Server B, you should NOT have both shell sessions open on the same machine. Either use a separate machine entirely or - better still - get a buddy to log onto Server A from their end and you get on Server B from yours. Total separation 5) Do not ever just su into root. You use sudo, or some kind of carefully managed solution such as CyberArk to get the root creds when needed

  • @magicmulder

    @magicmulder

    8 ай бұрын

    Also for (2), never try to "improve" anything during the actual action. I once prepared a massive Oracle migration that I had timed to take about 3 hours. Preparation was three weeks. As I was watching the export script for the first schema during the actual migration, I thought "why not run two export jobs concurrently, it's gonna save some time". Yeah, made the whole thing slow down to a crawl, so it ended up taking 6 hours. Boss was furious. So no, never try to "improve" during the actual operation, no matter how big you think your original oversight was.

  • @lashlarue7924

    @lashlarue7924

    8 ай бұрын

    100%, upvoted.

  • @xpusostomos

    @xpusostomos

    7 ай бұрын

    I religiously never delete anything

  • @thedemolitionsexpertsledge5552

    @thedemolitionsexpertsledge5552

    7 ай бұрын

    I have no idea what any of this means but I feel like this is bad

  • @alvinbontuyan8083

    @alvinbontuyan8083

    5 ай бұрын

    Fucking up catastrophically with Bash commands is a canon event. It is religion for me to always copy a file/directory to "xxx.bak" before doing anything sensitive

  • @TheDrTrouble
    @TheDrTrouble Жыл бұрын

    The best practice is to rename the directory or file to something else. Idk how the developers are so calm when using deletion commands

  • @setasan

    @setasan

    11 ай бұрын

    Well, when you live in a poor country, being underpaid by a fucking contractor company, with a overloaded team. shit hapnz

  • @schwingedeshaehers

    @schwingedeshaehers

    11 ай бұрын

    I "deleted" on program from me with the cp command (I wanted to copy the config and the main file in a sub directory, but forgot to enter the directory after it, so it wrote the config to the main file) (I could get a older version of the file from the SD card, by manually read the content of that region and find one with it on it, as it doesn't override an save, but takes a new place)

  • @Funnywargamesman

    @Funnywargamesman

    11 ай бұрын

    On a home system? Absolutely. In a working environment? Doubtful. Maybe with a small company it would be acceptable, but creating an orphan database that may or may not contain sensitive information with no one in charge of it, or worse, no one who KNOWS ABOUT it, would be awful. God help you if that contains financial, medical, or government records.

  • @AndrewARitz

    @AndrewARitz

    11 ай бұрын

    @@Funnywargamesman you don't create it to keep it around forever, you create it as a failsafe for when you are doing potentially dangerous stuff, like deleting a whole database.

  • @Funnywargamesman

    @Funnywargamesman

    11 ай бұрын

    @@AndrewARitz I cannot tell you how many times "temporary" things become permanent on purpose, let alone the times people have said they are going to do something, like deleting a temp database they copied locally because their permissions didn't let them use it remotely, and then proceeded to forget to delete it. This will be especially true with the most sensitive databases, "because it's more important, so we should make a copy first, right?" Security is everyone's job and if you do (typically) irresponsible things like copying databases, "as a failsafe," chances are you are going to form a habit that means you will do it with a sensitive database. If you think YOU won't do it, that's fine, but assuming you are of average intelligence you need to remember 50% of people are dumber than you and some of them get REAL dumb. If you set policy to say that it would be allowed, then THEY will do it. This is exactly why I said that home environments and really tiny companies could be different, there it could/would be fine. Chances are, if you don't know the names of every single person in your company off the top of your head, it is too large to be that lax with data protection and management. Take it or leave it, it's my opinion.

  • @randomgeocacher
    @randomgeocacher Жыл бұрын

    A helpful hack is to set production terminal to red and test terminal to blue or something like that. Just a small helper to avoid human f’ups if you need to run manual commands in sensitive systems.

  • @tacokoneko

    @tacokoneko

    Жыл бұрын

    i second this I also use colors to differentiate multiple environments

  • @vaisakhkm783

    @vaisakhkm783

    Жыл бұрын

    it was easy and changing prompt color... but make a huge differece

  • @Wampa842

    @Wampa842

    Жыл бұрын

    I use colored bash prompts to differentiate machine roles - my work PC uses a green scheme, non-production and testing servers use blue, backups use orange, and production servers use yellow letters on red background. It's very hard to miss.

  • @darrionwhitfield46

    @darrionwhitfield46

    Жыл бұрын

    I use oh-my-posh with different themes

  • @iUUkk

    @iUUkk

    Жыл бұрын

    Both database servers were actually used in production.

  • @helmchen1239
    @helmchen1239 Жыл бұрын

    I once accidentally ran a chmod -R 0777 /var because i've missed a dot before the slash (in a web project with a /var folder), which (as i've now learned) may make a unix system totally unresponsive. I can very well understand how it feels, the moment you realize what you have just done. That did cost us a few hundred euros and kept 2 technicians busy for an afternoon on the weekend. Lessons learned, today we can laugh about it.

  • @Darkk6969

    @Darkk6969

    Жыл бұрын

    Ya, Unix / Linux will do what you tell it to do without any warnings. Pretty sure you sat there and wondered why that command is taking a long time to finish before you realize your mistake. Right then there it's the "Oh Shit" moment. 😀 Lucky for me though I use VMs so can always revert to previous snapshots.

  • @desoroxxx

    @desoroxxx

    Жыл бұрын

    the onosecond

  • @parlor3115

    @parlor3115

    Жыл бұрын

    @@Darkk6969 What if you ran it on the host?

  • @FurriousFox

    @FurriousFox

    Жыл бұрын

    @@parlor3115 he doesn't, Noah only runs things in virtualized environments, making snapshots every minute

  • @aarondewindt

    @aarondewindt

    Жыл бұрын

    Why does it make it unresponsive? I accidentally chmod 0777 the entire "/" once and well, I had to start again from scratch. Thankfully I was just creating a custom Ubuntu image with some preinstalled software for one of my professors. So it just cost me time. Still, I never figured out why opening up the permissions would lock everything up.

  • @Dairunt1
    @Dairunt1 Жыл бұрын

    One of my most stressful moments as a software designer was when I accidentally broke a test environment right before a meeting with our client; I managed to have the project running at a 2nd test environment but that really taught me the importance of backups and telling the rest of staff about a problem ASAP.

  • @christopherg2347
    @christopherg2347 Жыл бұрын

    If you are working with multiple shells, VMs, remote sessions or the like - make sure they are color coded based on the machine you are running against! It can be as simple as picking a different color scheme in windows. But it is just too easy to mess up when all the visual difference is a single number, somewhere in the header.

  • @neekfenwick

    @neekfenwick

    5 ай бұрын

    Yep, I came here to say this. For any serious system I connect to, I use different params for my session, in my case I like old fashioned xterm, something like: alias u@s="xterm -fg white -bg '#073f00' -e 'ssh user@server'" It's very useful to see the green red, blue etc colouring and be sure which system you're talking to.

  • @Kalmaro4152

    @Kalmaro4152

    3 ай бұрын

    It's very nice that Linux shells actually support setting session colors

  • @LordHonkInc
    @LordHonkInc Жыл бұрын

    "rm -rf" is one of those commands I have huge respect for cause it reminds me of looking down the barrel of a gun (or any similar example of your choosing): Best case, you do it a) seldom, b) after a lot of strict and practiced checks, and c) if there's no alternative; unfortunately, the worst case is when you _think_ you're in that best case scenario.

  • @givenfool6169

    @givenfool6169

    Жыл бұрын

    I sourced my bash history like an idiot about a week ago. I have so many cd's and "rm -rf ./"'s and other awful things in there. I somehow got lucky and hadn't used sudo in that terminal at the time. I got caught on a sudo check before it ran anything absolutely hell inducing. Just a bunch of cd's and some commands that require a sourced environment to execute. Super Lucky. Icould have wiped out everything, because just a couple commands after that was a "rm -rf ./" and it had already cd'd into root.

  • @henningerhenningstone691

    @henningerhenningstone691

    Жыл бұрын

    @@givenfool6169 Lmao it had never once occurred to me what havoc it could wreak if you accidentally source the bash history, since it had never occurred to me that that's even possible (because why the hell would you?!). But of course it is, what an eye opener!

  • @givenfool6169

    @givenfool6169

    Жыл бұрын

    @@henningerhenningstone691 Yeah, I was trying to source my updated .bashrc but my auto-tab is setup to cycle through anything that starts with whatevers been typed (even ignores case) so I tabbed and hit enter. Big mistake. I guess this is why the default auto-tab requires you to type out the rest of the file if there are multiple potential completions.

  • @Shadowserpant00

    @Shadowserpant00

    Жыл бұрын

    @@henningerhenningstone691 bro idk wtf you're talking about and it's scaring me

  • @oliverford5367

    @oliverford5367

    Жыл бұрын

    Do ll first, make sure you're wanting to delete that directory, the press up and change ll to rm

  • @MechMK1
    @MechMK1 Жыл бұрын

    For this reason, all our servers have color-coded prompts. Dev/Testing servers are green. Staging is yellow. Prod is bright red. When you enter a shell, you immediately see if you are on a server that is "safe" to mess around with, or not. The advantage to doing this in addition to naming your server something like "am03pddb", is that you don't have to consciously read anything. Doesn't matter if you accidentally SSH into the wrong server. If you meant to SSH into a "safe" server, then the bright red prompt will alert you that you are on prod. And if you meant to SSH into a prod server, then you better take the time to read which server it actually is.

  • @tacokoneko

    @tacokoneko

    Жыл бұрын

    i agree except there are only so many colors, so if manually controlling a lot of different machines (something that could maybe be avoided depending on what the servers do) i believe it's important to use unique memorable hostnames. the two servers in this story had hostnames 1 character apart and the same length, unless the names were all changed for the artwork

  • @seedmole

    @seedmole

    Жыл бұрын

    @@tacokoneko Yeah like imagine if those two characters were visually similar ones, like any combo of 3, 5, 6 and 8. Fatigued eyes could easily misleadingly "confirm" that you're on the right one when you're not.

  • @makuru_dd3662

    @makuru_dd3662

    Жыл бұрын

    Also, dont ever ever work on the live database, a lesson i have learned the hard way many times on my own.

  • @MunyuShizumi

    @MunyuShizumi

    Жыл бұрын

    @@makuru_dd3662 That statement makes no sense. No matter how critical a system is, you'll have to perform some kind of maintenance at least semi-regularly.

  • @makuru_dd3662

    @makuru_dd3662

    Жыл бұрын

    @@MunyuShizumi you make a backup or anything, yes you need to maintain it but not by making massive untested changes.

  • @robbybankston4238
    @robbybankston42382 ай бұрын

    I'm glad they didn't fire the engineer. It goes to show the differences in mindsets from some organizations that care about it being a learning experience (albeit an expensive one). Many corporations would have fired the engineer as soon as the issue was resolved without hesitation. Thanks to those orgs who care about their team members and being more concerned with lessons learned.

  • @ErikPelyukhno
    @ErikPelyukhno11 ай бұрын

    Your editing is phenomenal. What an insane series of events 😂 Glad gitlab was able to get back to running, seeing all that public documentation was refreshing to see since it shows they were being transparent about their continued mistakes and their recovery process.

  • @build-things
    @build-things Жыл бұрын

    As an engineer for a large company you got me in the feels talking about asking for help or posting a pr and then seeing all the mistakes you made😊

  • @stingrae789

    @stingrae789

    Жыл бұрын

    In my previous position I worked closely with one guy and we used to joke about how we were using each other as a rubber duck :D.

  • @EChan-eu2co

    @EChan-eu2co

    Жыл бұрын

    The buzzword is SRE and postmortems are supposed to be blameless now...

  • @jillfizzard1018

    @jillfizzard1018

    Жыл бұрын

    This is why you first mark the PR as a draft and read over the changes one more time before marking it as ready.

  • @mortache

    @mortache

    Жыл бұрын

    @@stingrae789 Damn I didn't know this thing has a name! I legit have done this before while discussing weird math problems

  • @xmorse
    @xmorse Жыл бұрын

    The real problem here is that you can delete any user data by simply mass reporting him

  • @technicolourmyles

    @technicolourmyles

    Жыл бұрын

    I'm seeing a lot of serious problems here... I guess this is why I never heard of GitLab before.

  • @PatalJunior

    @PatalJunior

    Жыл бұрын

    I highly doubt is instantly deleted, probably someone made the decision to delete it (could just be an account spamming a bunch of mess onto repositories, and that isn't good either.

  • @FighteroftheNightman

    @FighteroftheNightman

    Жыл бұрын

    ​@@technicolourmylesthey're literally the 2nd largest enterprise git solution provider in the world.

  • @nonamepasserbya6658

    @nonamepasserbya6658

    Жыл бұрын

    When in doubt, it's probably 4chan That low hanging fruit aside, not a good thing if someone can just do that with a bot acc. Maybe grant employees a special anti report protection can help until they find a more permanent solution against those trolls

  • @Webmage101

    @Webmage101

    11 ай бұрын

    ​@@PatalJunior6:21 literally says they fucked up by not making it check the details before deletion

  • @Tmccreight25Gaming
    @Tmccreight25Gaming2 ай бұрын

    Ultimate workplace comeback: "At least I've never nuked the entire database"

  • @usellstech-ip2sg

    @usellstech-ip2sg

    2 ай бұрын

    Better to have someone who knows what to do, than someone who has never experienced it

  • @reyynerp

    @reyynerp

    20 күн бұрын

    they work remotely

  • @danusminimus9557
    @danusminimus9557 Жыл бұрын

    Seen your video history and the evolution of your videos - this format is amazing and you're really good at it :D

  • @matthias916
    @matthias916 Жыл бұрын

    I once accidentally deleted 2000 rows in one of my companies production databases, everything was restored 5 minutes later but it felt so bad, can't imagine what deleting an entire database would feel like

  • @marco56702

    @marco56702

    Жыл бұрын

    terrible, sending the queries make you shiver

  • @varunkhadse5869

    @varunkhadse5869

    10 ай бұрын

    ig panick was at next level coz both dbs were deleted.

  • @Rncko

    @Rncko

    9 ай бұрын

    It feels like lighting a torch onto a sea of currency bank notes... that belongs to the company. (and company is just about to release year end bonus)

  • @Atulnavadiya

    @Atulnavadiya

    8 ай бұрын

    I have had good hands-on experience at my company on sql database but I'd check my query atleast 10 times before execute it..we had clients data saved in the database of more than 10 years..

  • @TrevoltIV

    @TrevoltIV

    8 ай бұрын

    @@marco56702Right, I’m always quadruple checking every query to make sure my retarded ass didn’t type delete * or something

  • @GanerRL
    @GanerRL Жыл бұрын

    imagine flagging messing with some employee and managing to bring down the entire site by proxy

  • @batorerdyniev9805

    @batorerdyniev9805

    Жыл бұрын

    What

  • @hypenheimer

    @hypenheimer

    Жыл бұрын

    Bot

  • @GanerRL

    @GanerRL

    Жыл бұрын

    @@hypenheimer beep boop

  • @Jacob-ABCXYZ

    @Jacob-ABCXYZ

    Жыл бұрын

    How to take down a site, the stealthy way

  • @kulled

    @kulled

    Жыл бұрын

    @@hypenheimer nah. it was probably a minecraft shorts bot account before he bought it though.

  • @sortebill
    @sortebill11 ай бұрын

    your content is really good, please keep up making these mini documentaries about tech failures!

  • @jamesrosemary2932
    @jamesrosemary2932 Жыл бұрын

    A long time ago we implemented a policy that absolutely nobody operates the production console alone. There always has to be someone else looking over your shoulder to point out oversights like the one in the video.

  • @HazySkies
    @HazySkies Жыл бұрын

    "Slams Ctrl+C harder than he ever had before" As a relatively new linux user, I felt that one.

  • @ss-to7ii

    @ss-to7ii

    9 ай бұрын

    As a new Linux user use the "-i" flag for "interactive" when using rm and a couple other commands.

  • @KR-tk8fe

    @KR-tk8fe

    6 ай бұрын

    As a windows user, I was very confused

  • @LC-uh8if

    @LC-uh8if

    5 ай бұрын

    @@KR-tk8fe CTRL+C. On most Unix/Linux based CLIs, this combination aborts whatever command you were running. Technically, it sends a SIGINT (Interrupt) to the foreground process (active program), which usually causes the program to terminate, though it can be programmed to handle it differently. Its basically, the Oh Shit or This is taking too long button.

  • @MrCmon113

    @MrCmon113

    3 ай бұрын

    ​@@LC-uh8ifIsn't that the same in Windows terminals? 🤔

  • @jhyland87
    @jhyland87 Жыл бұрын

    A few places i worked at as a linux admin or engineer, the shell prompts (PS1) were color coded. Green was dev, yellow was qa and red meant your in prod. Worked like a charm.

  • @blackbot7113

    @blackbot7113

    9 ай бұрын

    Yeah, that's the way I do it as well, just the other way round (red being test). Extends to the UI as well - if the theme is red, you're on the test instance of Jira, not the real one.

  • @jhyland87

    @jhyland87

    9 ай бұрын

    @@blackbot7113 Yeah, it's a very wise thing to do imo. Currently, I work at a bank, and I recommended we have the header in the UI of the colleague and customer portal be different colors for lower environments, as well as the PS1 prompt on the servers. And I kinda got snickered at and got a reply along the lines of "How about we just pay attention to the server and page were on?" Its crazy because it's such an easy change to implement and almost entirely prevents anyone making such silly (yet catastrophic) mistakes. Edit: I make the PS1 prompt for my own user on the servers different colors, but that only helps so much since I sudo into other service users (or root). Additionally, we "rehydrate" the servers every. couple months, which means they get re-provisioned/deployed, so any of those settings get wiped out entirely. For it to be permanent, it needs to be added in the Docker file.

  • @theultimatetrashman887
    @theultimatetrashman887 Жыл бұрын

    the realization of what you're doing before it finishes itself is so cruel and happens so often, thats why when you're doing a job you always do it slow but correctly

  • @DomskiPlays
    @DomskiPlays Жыл бұрын

    Our prod server has no staging environment or anything like that. I've asked the DB admin if the data and schema is safe in case of someone accidentally deleting everything and they told me everything is backed up daily. Kinda scared that I don't know how or where this is happening except for a job.

  • @indyalx

    @indyalx

    Жыл бұрын

    I checked my database backup script a couple days ago and noticed it hadn't backed up in 5 days O_O I SLAMMED the manual backup immediately. Then went and fixed the issue and made sure it would notify if there was no backup in 6 hours.

  • @CMDRSweeper

    @CMDRSweeper

    Жыл бұрын

    The next question is... "Have you tested the backups?" If they can't say for sure WHEN they were tested... Be very afraid...

  • @indyalx

    @indyalx

    Жыл бұрын

    @@CMDRSweeper we load the prod backup into staging nightly

  • @forbiddenera

    @forbiddenera

    Жыл бұрын

    6 hour full backups, mirroring/replicas, multiple servers and daily volume backups..

  • @robertbeisert3315

    @robertbeisert3315

    Жыл бұрын

    "Trust me, bro" only works in Dev. Every other environment needs regular verification.

  • @karmatraining
    @karmatraining Жыл бұрын

    An old best practice that so many people these days seem to forget or never have heard about is that every week, you try to pull a random file from your backup system, whatever that is. (Or systems, in this case). You will learn SO MUCH about how horribly your backups are structured by doing this - so many people think they set up good backup systems but never continuously test them in any way, and then they get big surprises (like the GitLab team) when they do need to fall back on them.

  • @minsiam
    @minsiam10 ай бұрын

    When I was just starting in a company, I accidentally deleted all the ticket intervals from the database. Causing all the tickets to close immediately and make some massive spam to the admins. I was really terrified of the situation and didn't know what to do, we didn't have any backup as well. I apologized as much as I can and didn't make another mistake like this again in years, sometimes mistakes make you work harder and be more careful in life.

  • @AndreGreeff
    @AndreGreeff Жыл бұрын

    I must say, I heard many stories about this.. but that was a very nice summary of the nitty-gritty details, thank you. (:

  • @daigennki
    @daigennki Жыл бұрын

    Awesome work on the video!! I love the editing being both funny and straight to the point, and your narration is easy to understand too. You seriously deserve more attention.

  • @matthewstott3493
    @matthewstott3493 Жыл бұрын

    Testing to verify backups, replication, failover and the like is absolutely critical. As new scenarios occur, having a feedback loop to update the plan is key. It's a continuous process that most shops have learned the hard way. It is boring and tedious but if you don't test you will experience catastrophic consequences.

  • @-TheBugLord

    @-TheBugLord

    Жыл бұрын

    Exactly. Just like a dam, if there is a weak-point at the bottom, it all may come crumbling down. There needs to be a lot of redundancy when it comes to backups. Especially when it comes to a big server. An engineer accidentally removing a database should not have that catastrophic of consequences.

  • @esa4573

    @esa4573

    Жыл бұрын

    Yeah, the general rule is/should exist for having to be ready for stuff like that. If your fuckup is non-recoverable or a massive pain, you did something wrong. I'm sure a lot of companies are practically "trained" for when someone yeets the whole database or service.

  • @ToastyWalrus7
    @ToastyWalrus72 күн бұрын

    The voiceover: Outstanding, the editing: premium, the humor: dryer than the sahara *inhales* just how I like it. Ive never hit the sub so fast, keep em coming man!

  • @derpnerpwerp
    @derpnerpwerp Жыл бұрын

    This reminds me of all the times I have been in the wrong ssh session just before doing something that would have been pretty bad. I setup custom PS2 prompts to tell me exactly what environment, cluster, etc I am in.. and even colorize them accordingly but the problem is.. you start to just ignore them after a while. Its also kinda dangerous when stuff becomes fairly routine that is manual and potentially damaging

  • @WackoMcGoose
    @WackoMcGoose Жыл бұрын

    As a former Amazonian (only QA for the now-ended Scout program, sadly), I read quite a few cautionary tales on the internal wiki about Wrong Window Syndrome. Sometimes, not even color-coded terminals and "break-glass protocols" (setting certain Very Spicy commands to only be usable if a second user grants the first user a time-limited permission via LDAP groups) is enough to save you from porking a prod database.

  • @Skyline_NTR

    @Skyline_NTR

    Жыл бұрын

    This interests me. Got any resources/links to set that up (dangerous commands temporarily allowed by time-limited permissions via LDAP)

  • @WackoMcGoose

    @WackoMcGoose

    Жыл бұрын

    @@Skyline_NTR Afraid not, it was several pay grades above me both in job role and in coding knowledge, and I lost access to the company slack back in december so I can't really ask anyone...

  • @ProgrammingP123

    @ProgrammingP123

    Жыл бұрын

    @@WackoMcGoose Ahh were you laid off also??? I was lol

  • @WackoMcGoose

    @WackoMcGoose

    Жыл бұрын

    @@ProgrammingP123 Yup, they disbanded the entire Scout division and then put a company-wide hiring freeze a month later so I had no hope of transferring...

  • @ChosenOne-wz6km
    @ChosenOne-wz6km Жыл бұрын

    This video is awesome! The step by step analysis of what occurred during the outage coupled with the story telling format helped me learn some things I didn't know about database recovery procedures. Please make more videos in this format!

  • @shashankh7768
    @shashankh77688 ай бұрын

    The story telling/edit is unmatched. Hands down best docu/short movie on youtube😂!

  • @MichaelJordan-hi4ed
    @MichaelJordan-hi4ed Жыл бұрын

    This genuinely made my day.

  • @TonytheCapeGuy
    @TonytheCapeGuy Жыл бұрын

    I can just imagine the relief that team felt when they find SOMETHING that they could use to restore files.

  • @Simone-uu8ne
    @Simone-uu8ne Жыл бұрын

    all things aside, that wasn't that bad. Yeah, they weren't operational for 24h, but that made many other companies realize their fault management. For example, my uni professor told us about this incident and we could comprehend the importance of backups and testing

  • @gblargg

    @gblargg

    Жыл бұрын

    I think the biggest issue was losing 6 hours of commits and comments.

  • @kookie-py

    @kookie-py

    Жыл бұрын

    @@gblargg people will cope

  • @gblargg

    @gblargg

    Жыл бұрын

    @@kookie-py Agreed, virtually all of them will have the commits locally as well. Just noting that the data loss is a bigger deal than mere downtime.

  • @kookie-py

    @kookie-py

    Жыл бұрын

    @@gblargg right

  • @_Titanium_

    @_Titanium_

    Жыл бұрын

    This is why programming in general is great, nobody dies if you fuck up. (Obvious exceptions, medical, aviation etc)

  • @swaggy3987
    @swaggy39878 ай бұрын

    What's far more impressive about this whole situation is how calm the engineers were in handling the situation. That to me is far more valuable than having engineers that are too gun-shy to make prod db changes at 12AM and panic when something goes wrong.

  • @Dobaspl
    @Dobaspl5 ай бұрын

    Even before I started working in one company, one IT specialist deleted the directories of the new CC-supporting system. This was shortly after its implementation into production. Worse still, it turned out that the backup process was not working properly. For a week, the team responsible for programming this system practically stayed at work, recreating the environment almost from scratch. :D

  • @streetchronicles5693
    @streetchronicles5693 Жыл бұрын

    Yesterday I was added to a support team because we are getting a lot of tickets from users not waiting long enough for a service to load and closing the connection early. I died laughing from this story.

  • @SteveAcomb
    @SteveAcomb Жыл бұрын

    Great video! Well produced content about software engineering war/horror stories are exactly what I’ve been looking for, keep it up!

  • @felixbluwox
    @felixbluwox Жыл бұрын

    One idea to help prevent this is setting up the ssh sessions so each one has a different fore/background color, lets say the prod machine has green foreground and the backup is blue and make it standard for everyone working with the terminals, that way it'll be harder to get confused between the two. you can even have multiple ssh terminals and assign each one a different foreground color.

  • @jfbeam
    @jfbeam Жыл бұрын

    The #1 thing I learned WAY EARLY on in my IT career (three decades): Never delete anything you can't _immediately_ put back. Never do anything you can't undo. Instead of deleting the data directory, _rename_ it. If you're on the wrong system, that can easily be fixed. (and on a live db server, that alone will be enough of a mess to clean up.) As for backups, if you aren't actively checking that (a) they've run, (b) they've completed successfully, and (c) they're actually usable... well, this is the shit you end up in. (The fact they're actively hiding ("lying") about this fiasco should be criminal.)

  • @kurenaigames5357

    @kurenaigames5357

    Жыл бұрын

    yea renaming is the key. first rename, then setup everything and then delete the renamed folder like a few months later.

  • @bennythetiger6052
    @bennythetiger6052 Жыл бұрын

    This video made me say "Oh... my... God..." way too many times 😂😂. Felt like some Chernobyl documentary about a bad sequence of actions. Love it! This is very insightful as to what things can take place on these types of environments as well as what are some measures that can prevent major falis like that. It's also super interesting to see that, no matter how perfect a software system is, humans will still find a way to screw it up 😂

  • @blazi_0

    @blazi_0

    Жыл бұрын

    Bro let's also don't forget the damage had already done, the server was down for like 18 hours thousands of prs, comments, issues and projects are all delete permanently, this should be a bigger deal

  • @mrsharpie7899

    @mrsharpie7899

    Жыл бұрын

    I'd love to see the USCSB do an animation on this incident lmao

  • @daryl9915
    @daryl9915 Жыл бұрын

    A couple of jobs ago, I had a colleague who managed to do worse than this. I think they were playing about with learning Terraform and managed to delete the entire account. Prod servers, databases, the dev/qa servers, disk images, even the backups. Luckily it was a smaller account hosting a handful of tiny trivial legacy sites, but even so, we didn't see them for the rest of the week after that mishap

  • @lashlarue7924

    @lashlarue7924

    8 ай бұрын

    😱😱😱😱😱😱😱😱😱😱😱😱😱😱

  • @k7y
    @k7y Жыл бұрын

    At my previous job anything done on production servers required a change request which takes about a week to get approved and complex commands had to be tested on Lab environment before they could be copy pasted to production server.

  • @thetophattedanon
    @thetophattedanon10 ай бұрын

    I do not know how I got here, I don't get most of the video, But I am absolutely lovin' It as It's bloody entertaining.

  • @markh3684
    @markh3684 Жыл бұрын

    Mistakes in the moment happen. I'm focusing more on the "we thought things were working as expected" parts. The backup process familiarity, backups not going to S3, Postgres version mismatches, insufficient WALs space, alert email failures, diligence on abuse deletes... These were all things that could have been and should have been caught way before the actual incident.

  • @CarrotCastle
    @CarrotCastle Жыл бұрын

    One of my first jobs in IT was working as a big data admin and this video allows me to re-live the spicy moments of that job but with none of the responsibility attached

  • @NexusGamingRadical
    @NexusGamingRadical Жыл бұрын

    My only tech lead yet told me almost the exact same story after i got stressed out after breaking some layout in our production website. Instantly destressed and was a great intro to Software Engineering :D

  • @oliver-shi
    @oliver-shi Жыл бұрын

    never expected to see a public postmortum for something like this, cool video!

  • @jeromesimms
    @jeromesimms Жыл бұрын

    Wow! This was great and so interesting. I'm so glad I found this channel. I would love to hear more in depth analysis of software engineering fails

  • @hchris96
    @hchris96 Жыл бұрын

    I didn’t realize I would like these videos, but you are a good storyteller for production issues and I hope to see more in the future I am gonna share this with some of my coworkers

  • @hasanpatel9029
    @hasanpatel902911 ай бұрын

    The GG part In what to do if you delete your production DB always gets me, nice content.

  • @chrisfung443
    @chrisfung4432 ай бұрын

    Lucky guy that he found the data from Manually Snapshot instead of backup. Can't imagine how the admin-1 feeling in that moment.

  • @wojtekpolska1013
    @wojtekpolska1013 Жыл бұрын

    respect for not firing the guy, it was obviously just a small mistake, and it wasn't his fault that the backups didn't work. it shouldn't be possible for 1 command to completely delete everything in the first place. Good that they didn't just use him as a scapegoat :p

  • @yerpderp6800

    @yerpderp6800

    Жыл бұрын

    If they fired him they would just reintroduce the possibility of the same thing happening again in the future. I'm pretty sure the old employee will be paranoid for a loooong time and will double-check from now on lol. An expensive lesson but a lesson nonetheless.

  • @tuxie93

    @tuxie93

    Жыл бұрын

    Yep and he'll train new employees making super sure to emphasize triple checking before deleting from prod.

  • @D00000T

    @D00000T

    10 ай бұрын

    That’s Unix systems for you. Their open nature makes them super useful for a lot of things but it’s also so easy to break them. Plus that old trick of telling new linux users that sudo rm -rf is a cool easter egg command wouldn’t be the same with more safeties and preventions.

  • @BitTheByte

    @BitTheByte

    10 ай бұрын

    What if I want to delete everything? I don’t want a baby proofed OS. I want an OS that does what I want. Even if I want to burn it all

  • @wojtekpolska1013

    @wojtekpolska1013

    10 ай бұрын

    @@BitTheByte why buy a computer at that point lol

  • @hummel6364
    @hummel6364 Жыл бұрын

    In my vocational school I had a subject simply called "Databases" and our teacher there once told us a story about how one of his co-workers lost his job. In essence he did everything right, created his backups and backup scripts and everything worked. At some point during the lifetime of the server this was running on someone replaced a harddrive for whatever reason, this lead to a change of the device UUID, which he had hard-coded into his backup script, when the main database failed a year or two later, they tried restoring from this backup only to find that there was none. Wasn't even really his fault, the only mistake he made was not implementing enough fail-saves. Nowadays we have it comparatively easy with all the automatic monitoring and notifications, but this was at least 30 years ago.

  • @thewhitefalcon8539

    @thewhitefalcon8539

    Жыл бұрын

    I guess that could have been solved by testing the backups. Install the database software on a spare server or just your own workstation, and then restore the backup onto it

  • @hummel6364

    @hummel6364

    Жыл бұрын

    @@thewhitefalcon8539 well the backup ran properly for years, he just never thought that the UUID might change

  • @thewhitefalcon8539

    @thewhitefalcon8539

    Жыл бұрын

    @@hummel6364 I suppose as long as he's employed he should probably be checking the backup at least every couple months. Would I have remembered to do that? I dunno, but I'm not employed as a database admin.

  • @yerpderp6800

    @yerpderp6800

    Жыл бұрын

    ​@@hummel6364 yeah he kind of deserves to be fired...feel like it should be common sense the hdd could fail, no good excuse to not expect that. You should almost never hardcode stuff, not sure why they thought it was okay to hardcode the uuid of a drive that would one day fail.

  • @hummel6364

    @hummel6364

    Жыл бұрын

    @@yerpderp6800 I think the idea was that the device might change from sdX to sdY when other drives are added or removed, so using the UUID was the only simple and safe way to do it.

  • @jim2lane
    @jim2lane11 ай бұрын

    OMG, we have all been there haven't we? That awful, dreadful realization after deleting something that you shouldn't have. Mine was back in the days of manual code backups, before ALM tools were ubiquitous like today. I thought I had taken the last three days of code changes and overwritten the old backups that were no longer needed. And then I realized that I had done the exact opposite, and just deleted three complete days of coding - and would now have to recreate them from scratch 😒😭

  • @stevencoetzee1597
    @stevencoetzee1597 Жыл бұрын

    By far the most suspense I have felt during a dev story

  • @edc2186
    @edc2186 Жыл бұрын

    As a dev for a large company who has been on a number of late night calls, I literally gasped at this. But good on the team to work through the issue, and good on management to keep these guys around

  • @CharlesChacon
    @CharlesChacon Жыл бұрын

    I’m pretty sure this event only ended up affecting things like comments and issues, but not the actual git repositories themselves, which would have been a huge relief, I imagine. Still, this was one of the most interesting things I’ve ever followed and ended up motivating me to learn a ton about databases, cloud practices, devops, and everything-as-code culture. Thanks for providing such a great lesson, GL. And huge kudos to them for transparency

  • @yaboirairai
    @yaboirairai11 ай бұрын

    Omg I have never felt stressed watching a KZread video like I did at 4:02. I had to pause and catch my breath

  • @CryptbloomEnjoyer
    @CryptbloomEnjoyer8 ай бұрын

    I know the exact feeling of terror the moment you realize the command you just ran has is about to cause havoc

  • @malborboss
    @malborboss Жыл бұрын

    We need more videos like this one. This was amazingly interesting

  • @hououinkyouma2426
    @hououinkyouma2426 Жыл бұрын

    Can't wait for part 2

  • @kevinfaang

    @kevinfaang

    Жыл бұрын

    Could just be missing the sarcasm but if you're referring to the ending Google bard isn't exactly the best at being factually accurate...

  • @Xanhast

    @Xanhast

    Жыл бұрын

    @@kevinfaang maybe he's being ominous :o

  • @enesemrebulut4044
    @enesemrebulut40448 ай бұрын

    Loved the content Thanks @Kevin Fang

  • @thunderchild4816
    @thunderchild4816 Жыл бұрын

    This is my worst nightmare. I was on a release where a database administrator did something similar and he spent a good 10 minutes swearing. We had backups but it was a pain to find which one to use and get it setup.

  • @justdoityourself7134
    @justdoityourself7134 Жыл бұрын

    Having a live screenshare with team members watching might seem a little wasteful. But for critical procedures like this, it is well worth the added cost.

  • @Navak_

    @Navak_

    10 ай бұрын

    Most people don't see the importance of such extreme level of caution until it's too late. It's like handling a firearm.

  • @eboubaker3722
    @eboubaker3722 Жыл бұрын

    Wow the amount of stuff i learned here is huge, please make more reviews like these i subscribed and turned on notifications please don't disappoint me

  • @eswarnichtsmehrfrei
    @eswarnichtsmehrfrei4 ай бұрын

    All my backup jobs have to report to an uptime service.

  • @sarthakbhutani9460
    @sarthakbhutani9460 Жыл бұрын

    Haha, thanks for creating the whole story out of it, really entertaining & learned a lot of things :P

  • @johnthomas2970
    @johnthomas2970 Жыл бұрын

    This gives me good insight on why our tech team keeps breaking shit….

  • @iTsBadboyJay
    @iTsBadboyJay Жыл бұрын

    absolute nightmare. loved every min of this

  • @loupassakischristos9758
    @loupassakischristos9758 Жыл бұрын

    I experienced something similar a couple years ago, it's the kind of thing that you think only can happen to others but yeah... I had to delete some specific data from the production database, I created the sql requests and executed them to the testing environment. The dataset between those databases is completely different, and the requests passed without any issue. But when I passed them to production they were taking way too long and then I realised. I almost had a panic attack. I reported the incident immediately and was mentally prepared to be fired. Fortunately we could retrieve most data from a backup and the lost ones were not that big of an issue. I still work in the same company :p

  • @beatrizdominguez9149
    @beatrizdominguez914911 ай бұрын

    This is really good! What editing software do you use?

  • @Socsob
    @Socsob Жыл бұрын

    This is so cool to know the inner workings of a team like this

  • @jumbo_mumbo1441
    @jumbo_mumbo1441 Жыл бұрын

    Honestly the worst part of this was all the backup failures

  • @james123428
    @james1234285 ай бұрын

    Very interesting and easy to understand for layman. I’m sure most of us could also learn from the mistake even if we don’t deal with databases or code Ps - If you meant to blur/delete the names at 9:50 you missed the “replying to” part.

  • @ChandravijayAgrawal
    @ChandravijayAgrawal5 ай бұрын

    One thing I learnt all this is never run delete command ever, and if you are paste the screenshot of command in your group before running it

  • @bmo3778
    @bmo3778 Жыл бұрын

    I barely understand anything here, but all I can say is massive thanks to the team who have worked hard, advancing our computer tech to the current state we have!

  • @jsvanderburgh
    @jsvanderburgh Жыл бұрын

    Great video, nice editing, and just very entertaining overall!

  • @joseaca1010
    @joseaca1010 Жыл бұрын

    i cant even imagine the sheer terror Team Member 1 felt when he saw the db in which he ran the delete command

  • @JaxVideos
    @JaxVideos4 ай бұрын

    A long time ago, in our thriving software shop, on a corporate network of 50 or so SGI workstations and some heavier iron a script's rm -rf line accidentally picked up a space character after one of the leading '/'s of a file name. As all disks were remotely mounted, this became a corporate total deletion, after midnight, with the main server room locked tight.

  • @rishavmasih9450
    @rishavmasih9450 Жыл бұрын

    Oh God my heart started sinking when you said he noticed the shell he was running the command in.

  • @Rametesaima
    @Rametesaima Жыл бұрын

    I've always been paranoid when working in Prod. Always make it a point to have at least the Ops Lead on a screen-sharing session where I show what I'm doing while requesting affirmative acknowledgement of each step before proceeding. It's annoying. It's slow. But boy ohh boy does it make me feel safer.

  • @isaiahsmith6016

    @isaiahsmith6016

    Жыл бұрын

    It may be slow but look at it this way. You're probably saving a lot more time in the long run by preventing something horrible from happening in the first place.

  • @MPSmaruj
    @MPSmaruj11 ай бұрын

    Also one thing I used to scoff at when I was a newbie was assigning names as aliases to your servers. Like: actual words instead of numbers. It seemed a little asinine to me at first but even in this scenario: it's much easier to confuse db1 and db2 than, eg.: amelie and betrand.

  • @Sakrosankt-Bierstube
    @Sakrosankt-Bierstube Жыл бұрын

    That is actually a very entertaining jet very educational video, i love it!

  • @MrB10N1CLE
    @MrB10N1CLE Жыл бұрын

    3:52 it was at this moment when the viewers collectively scream, transcending space-time and raising a cosmic choir of dread and regret.

Келесі