Kubernetes Events Are Broken (If You Are Building a Developer Portal)

Ғылым және технология

Are you building a developer portal and relying on Kubernetes events to monitor and manage your cluster? You might want to rethink your strategy. In this video, we delve into the pitfalls and challenges of using Kubernetes events in developer portal contexts. Learn about Kubernetes events limitations, and how they prevent day 2 operations in Internal Developer Platforms (IDP).
#Kubernetes #DeveloperPortal #InternalDeveloperPlatform
Consider joining the channel: / devopstoolkit
▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬
➡ Gist with the commands: gist.github.com/vfarcic/ba423...
▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬
If you are interested in sponsoring this channel, please use calendar.app.google/Q9eaDUHN8... to book a timeslot that suits you, and we'll go over the details. Or feel free to contact me over Twitter or LinkedIn (see below).
▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬
➡ Twitter: / vfarcic
➡ LinkedIn: / viktorfarcic
▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬
🎤 Podcast: www.devopsparadox.com/
💬 Live streams: / devopsparadox

Пікірлер: 67

  • @DevOpsToolkit
    @DevOpsToolkit5 ай бұрын

    How do you propagate events to parent resources?

  • @SeanChitwood

    @SeanChitwood

    5 ай бұрын

    You don't propagate Events to parents, the controller watches for events related to the objects they create, or at least the kind of objects they create.

  • @droogielamer
    @droogielamer5 ай бұрын

    Event propagation would probably give rise to scalability issues. Just imagine an application running on hundreds of nodes, with thousands of pods and all the events propagating up to for example one deployment object. How many write operations per second would we push on etcd, just to keep the eventlog for this deployment up to date? I understand the frustration with regards to this, but I believe it's actually more or less correctly implemented. If you want to dig deeper into events, then I believe the way crossplane does it - by digging - is the correct way. If I remember correctly, end point slices where introduced to (amongst others) reduce the need to update service objects on all nodes just when pods got started/terminated, which made it difficult to scale k8s to thousands of nodes. I believe propagating events would put orders of magnitude more pressure on sync. It's probably possible to create some sort of controller, which listens for all events and annotates them with contextual information (cluster, pod, namespace, metadata, whatnot) and push it off to a designated system, which could map them to parent resources and be eventually consistent. That way it won't degrade the k8s controlplane performance that much.

  • @sergeyp2932
    @sergeyp29325 ай бұрын

    I can see two problems: aggregating events from child resources, and event filtering (by urgency). First problem can be relatively easily solved on client side by adding "recursive walk" mode (searching resources with ownerReference to requested one, then searching resources with references to found ones, et cetera) for events as well as for statuses. Also, maybe, it can be done on server side via some API extension, via creating a "aggregated" resource which events and statuses are populated from specific subset of resources (for example, aggregated ClusterRoles does something similar for RBAC rules). Second problem is event filtering, and it looks no so easy to solve, as it requires to reconsider the usage of "type" property of event object. For now, Kubernetes supports two standard event types: "Normal" and "Warning". Most syslog implementations (for example) supports eight levels of urgency (emergency, alert, critical, error, warning, notice, info, debug). Looks like Kubernetes events needs something similar.

  • @DevOpsToolkit

    @DevOpsToolkit

    5 ай бұрын

    💯

  • @juanbreinlinger
    @juanbreinlinger5 ай бұрын

    As a k8s "implementor" I'd say I have been bitten by that at some point. I believe k8s has some dark corners where is kind of hard to understand what's going on when things don't go as expected. The design is not consistent... on one hand you have built in resources like deployments, RS, namespaces, etc. etc. and on another CRDs. Maybe the answer would be to have a bare bone k8s without resources. Then install deployments, RS, namespaces as CRDs with their corresponding controllers? We could even have simpler models where there is no need for RS for example? That would be great. But I believe there must be some side effects, like chicken and egg problem, deadlocks... who knows. These things are really complex problems to solve and normally very smart people are behind. Finally after working many years in a combination of big companies and small startups... I have my serious doubts that developers can deal with infrastructure and vice versa. Too much for a single role. Each time I land into a company that decided to go with "everybody does everything" in a "devops" way... is far from ideal to say the least. Infrastructure guys (SRE/Devops/Sysadmins you name it) can make a huge impact in teams efficiency, performance and mantainability

  • @makinote

    @makinote

    5 ай бұрын

    I completely agree but also, developers have other more important stuff to learn/do, actualy improve the product that the company sells and bussiness logic

  • @tobiaskasser2700
    @tobiaskasser27005 ай бұрын

    Great video Victor, feeling it every day.

  • @autohmae
    @autohmae5 ай бұрын

    My workaround which I've not build yet, because I'm still at the start of that journey, is to also generate all the monitoring for all the things and have the monitoring show that a lower level thing failed. I suggest you create a Kubernetes enhancement proposal. 🙂

  • @fanemanelistu9235
    @fanemanelistu92355 ай бұрын

    I don't have a solution. I didn't even know what was bothering me before you spelled it all out.

  • @SeanChitwood
    @SeanChitwood5 ай бұрын

    So, the propagation should be done by the operators, when an event related to a resource they created happens, they apply their knowledge of the their resource to create a new event (if necessary) and write it setting the related resource field the resource they are responding to. So the batch operator watches the jobs that it creates and when one fails or does something that the operator feels the need to report on, it creates the new event. What I think may be the break in the system (I agree it could be better), is that the cronjob-controller doesn't populate the related field with a reference to the Job that completed/failed. The information is in the message field, which makes it human readable, but not machine readable. Which is kind of funny, because too many times error messages are machine readable and not human readable. Basically, I'm saying the Events are not broken (all the tools are there), but the controllers are (Not using the tools)

  • @DevOpsToolkit

    @DevOpsToolkit

    5 ай бұрын

    You're right. It is indeed the fault of controllers, at least in part. We still need a mechanism to choose which events will be propagated by controllers.

  • @SeanChitwood

    @SeanChitwood

    5 ай бұрын

    @@DevOpsToolkit Who should choose, I feel like the expertise is always the team creating the controller? My point is that we likely have the events we need, but we need more metadata around why something failed. The controller has the expertise to make the decisions what to publish and what not, the event object has the structure to allow a causal chain to be built up, at least for resources that are in Kubernetes. External resources are harder, but some kind of ExtendedEvent CRD should be able to help. When all controllers create their event, if they set the related field to the event they are responding to, then you end up with a chain that goes directly to the initial failure, because Events are objects and you can create an ObjectReference for them

  • @DevOpsToolkit

    @DevOpsToolkit

    5 ай бұрын

    @SeanChitwood oh yeah. If you are creating a controller, you are in control.

  • @randomcontrol
    @randomcontrol5 ай бұрын

    Even es someone who understands all that it’s a pain having to traverse through all the sub-resources to find out if there is a problem and where it is

  • @MrEvgheniDev
    @MrEvgheniDev5 ай бұрын

    @DevOpsToolkit Thank you. This exactly problem we facing everyday, another point for our platform roadmap, which partly collected from your videos :)

  • @DevOpsToolkit

    @DevOpsToolkit

    5 ай бұрын

    The current situation is, in a way, normal. It's to be expected to start with day zero operations (creation of abstraction instances). I feel that we (the industry) figured it out now so it's time to move on into next areas and i feel that's observability for service consumers.

  • @MrEvgheniDev

    @MrEvgheniDev

    5 ай бұрын

    @@DevOpsToolkit ok, but you mentioned is not best idea to propagate events to parent resource, but where is alternative place to collect this chain of events, to be native for k8s? OpenTelemetry?

  • @DevOpsToolkit

    @DevOpsToolkit

    5 ай бұрын

    I don't think i said that it is not a good idea to propagate events to parent resources. Quite the opposite. We need to do that, but through filters. We need to figure out what should be propagated and what shouldn't. In other words, i think we need to be able to define filters and propagate events that match them. That being said, there are additional complications like performance of etcd that need to b solved as well.

  • @MrEvgheniDev

    @MrEvgheniDev

    5 ай бұрын

    ​@@DevOpsToolkit But if we need to figure out what to propagate and what not, this is custom operation for almost every resource type. This makes question, bit more complex. What about to not use etcd for that, but to make another optional abstraction for keep tracing pipe, and filter it later on display/analyze problematic resource, by standard tools?

  • @DevOpsToolkit

    @DevOpsToolkit

    5 ай бұрын

    @MrEvgheni070 that is the key question. Etcd is problematic on quite a few levels, performance bring on of them. I don't think that the solution is to add something that will serve a similar function in parallel. That would result in a whole new set of problems. Kubernetes would suffer even more if it would need to sync everything to two data stores. I think we need to replace etcd altogether. I did not want to dive into etcd subject in that video since that would result in me diving into a completely separate subject. I'll try to explore it in one of the upcoming videos.

  • @neomotsumi
    @neomotsumi5 ай бұрын

    Have implemented somewhat a way for our IDP platform using argo notifications(with a hard-refresh annotation, although seems to have performance implications on argo itself) and pushing the notifications to a redis instance and querying the resource state from there, allowing a custom IDP plugin to consume the events from redis.

  • @DevOpsToolkit

    @DevOpsToolkit

    5 ай бұрын

    A better way would be to replace etcd with something more capable to run at scale.

  • @pirolas001
    @pirolas0015 ай бұрын

    I've a different opinion. Maybe the problem it's not that the people working with it are not experts, they are just not familiar with it, and that for me it's the real problem. There are many ways to solve the issue and we can build as much abstractions we want, in the end, we need to learn how to drive a car before driving it. It's the same situation here. Those problems you describe are not for "experts to solve", it's the bare minimum to drive the car on our roads. Experts are working at configuring and managing everything, as well fixing real hard problems, not some deployment which some developer did wrong because he don't even know how to shift gears, or won't recognize a signal on the road, because he never learned it. Nevertheless, there are work to be done at the logging level and at that end, we both agree. Simply not for the same reasons.

  • @javisartdesign
    @javisartdesign5 ай бұрын

    Good point

  • @meyogi
    @meyogi5 ай бұрын

    Regarding specific Crossplane troubleshooting, I find Komoplane very useful (by the way, quite suprising that Upbound does not provide a similar tool ;))

  • @DevOpsToolkit

    @DevOpsToolkit

    5 ай бұрын

    It's coming... :)

  • @meyogi

    @meyogi

    5 ай бұрын

    @@DevOpsToolkit ahah I was pretty sure that something was going to be proposed, thanks for the teaser 😄 Maybe a demo for Paris KubeCon in march ...? (I will be there 😉)

  • @DevOpsToolkit

    @DevOpsToolkit

    5 ай бұрын

    @meyogi can't reveal anything just yet :( See you in Paris.

  • @holgerwinkelmann6219
    @holgerwinkelmann62195 ай бұрын

    In our (not yet crossplane based) Operators we just expose CR Related events. As expected, but that's easy as you execute the operations in the scope of the managed resources. That's might be the case with all other tools as well, I can't see automatic abstraction translating from a low level k8s event into a high level CR Event. But if your Platform (i.e. Crossplane) supports Functions and Pipelines the Tool should be able to emit human written events to the actual CR they are operating on, based on the state information the function has. ?!?!?!? BTW_ I never understood why so many operators do not expose CR specific events.

  • @mrgdevops
    @mrgdevops4 ай бұрын

    apple watch: its time to STAND UP

  • @Alex-li8nb
    @Alex-li8nb5 ай бұрын

    Maybe "kubectl describe" can benefit from a knob that acts similarly to "crossplane trace"?

  • @DevOpsToolkit

    @DevOpsToolkit

    5 ай бұрын

    The problem with `kubectl describe` is that you need to first find out which child resource produced the events you're interested in. The issue is really with design of controllers. It should be their responsibility to propagate events up.

  • @abhishekpareek1983
    @abhishekpareek19835 ай бұрын

    New subscriber. Unrelated, but is your course(s) already out? Your linked site doesn't show any

  • @DevOpsToolkit

    @DevOpsToolkit

    5 ай бұрын

    Going out Thursday next week.

  • @abhishekpareek1983

    @abhishekpareek1983

    5 ай бұрын

    @@DevOpsToolkit looking forward to it!

  • @wishmeheaven

    @wishmeheaven

    2 ай бұрын

    @@DevOpsToolkit Where to find? p.s. thanks for the video!

  • @DevOpsToolkit

    @DevOpsToolkit

    2 ай бұрын

    Where to find Crossplane tutorial?

  • @wishmeheaven

    @wishmeheaven

    2 ай бұрын

    @@DevOpsToolkityour courses

  • @hugolopes5604
    @hugolopes56045 ай бұрын

    in theory the end developer should never have to understand why the claim deployment fail. validations of what is possible through the claim interface should be done before applying it, and if it fails , then its the platform team job to troubleshoot and add validations to the claim to proper inform the developer of what is wrong when he tries to apply the claim with the same issue. imagine the same with apis. why should the end user recieve all the errors of a chain of api calls that happen beyond the api interface he knows? end users should not even have access to the events of the issues beyond his claim. there is also the security angle. exposing exceptions stacks in api call is a known vulnerability. it should never be done. the same logic can apply to k8 CRD interfaces.

  • @DevOpsToolkit

    @DevOpsToolkit

    5 ай бұрын

    I don't think it is so black and white. I do agree that there are issues that platform teams should solve, but I also think that there are issues that should be exposed to end users. You can observe the same with almost any service provider. There are issues in AWS that you don't see because they are internal issues, and there are those that you do see since they are meant to be seen by users. The same can be said for Heroku, Fly.io, or almost any other Cloud service. The goal is to be able to surface whose that are fixable by users and hide those that are not. Also, I did not say that all events should be propagated to parent resources. Quite the opposite. I said that's just as bad as having no visibility. What I'm advocating is to have a mechanism that will allow service providers (platform teams?) to decide what should be propagated up the chain and what should not. That can be a subset of issues or tailor made messages based on issues or something else. I'm advocating for middle ground. Neither full visibility nor blindness.

  • @oleksandrsimonov9200
    @oleksandrsimonov92005 ай бұрын

    By my experience is a bad idea to give such freedom to developers. If the no good architect they will do a mess. So architect and devops should tell how their software will be delivered and they must to implement it.

  • @holgerwinkelmann6219
    @holgerwinkelmann62195 ай бұрын

    Of course Propagation must go with Aggregation and Related, blindly Propagating Low level events to higher Level Object does not make sense in terms scalability and noise given to the user.

  • @ffelegal
    @ffelegal4 ай бұрын

    Isn't this a problem with "microservice" architecture? Everything is so decoupled that only an observability tool can put everything together to present a unified meaningful thing? There's so much of this with cloud computing as well. I guess in argocd you'd have a perfect visualization of the problems, no? Exactly what you want, parent create a child, who created another child and failed?

  • @DevOpsToolkit

    @DevOpsToolkit

    4 ай бұрын

    What I'm trying to say is that for a person experienced with kubernetes that makes a lot of sense but for those building platforms that will be used by others it does not.

  • @ffelegal

    @ffelegal

    4 ай бұрын

    @@DevOpsToolkit nobody said it was easy haha But I feel your pain. I guess this is the price of decoupling right there. Everything is decoupled, even the events, the error, the understanding lol

  • @kurniadwin4597
    @kurniadwin45975 ай бұрын

    eventhough developer has access to it, I don't think it will be helpful, because developer can't do nothing about it, that is the primary reason we abstract it, we want to hide low level detail, just poping out the low level event to developer will not help because at the end, the developer will ask the tools (developer portal) team to fix it, Instead, I think it is better to the tools (developer portal) team, create their own monitoring, for example if some resource provisioning is taking time too long, it will alert the tools team, so they can investigate right away.

  • @DevOpsToolkit

    @DevOpsToolkit

    5 ай бұрын

    There should be a balance between things a service consumer and a service provider should fix. For example, AWS is a service provider and it will fix issues with hyperscalers but it will not fix issues with you doing something wrong with EC2. The same goes for developer portals. There must be a way to distinguish what is the domain of consumers and what is in the hands of providers.

  • @DevOpsToolkit

    @DevOpsToolkit

    5 ай бұрын

    It's definitely not easy, but it is solvable. Take a look at hyperscalers. They do not show you all the issues but mostly those related to you as a service consumer. You don't see issues related to hypervisor s when you create an EC2 instance but you do see those related to EC2 itself. Going back to your example. I would assume that there are policies in place that allow only specific namespaces to be used (one of your examples). That's easy to do today. Now, to be clear, i am not saying that we will always be able to distinguish problems related to consumers from those related to providers. What i am saying is that we can get closer to solving it. Any percentage is better than 0%. That's also similar to how hyperscalers operate. It is not always clear whether an issue is related to what users did or it is in the domain of providers (e.g. AWS). Still, even if it's clear in 50% of cases, that's not a bad deal.

Келесі