How We Power the Largest AI Deployments on the Planet: Running Vir... Brandon Jacobs & Lukas Gentele

Ғылым және технология

How We Power the Largest AI Deployments on the Planet: Running Virtual Clusters at Scale - Brandon Jacobs, CoreWeave & Lukas Gentele, Loft Labs
Running and managing a large number of Kubernetes clusters on bare metal poses significant challenges, from security to GPU provisioning to scalability. Specialized cloud provider CoreWeave experienced these first-hand, operating 3,000+ Kubernetes clusters on top of 5,000 bare metal nodes with massive amounts of GPUs to power modern AI applications at scale. In the session, we’ll dive into these challenges and how CoreWeave partnered with Loft Labs, the maintainers of vcluster, to create this serverless Kubernetes experience for numerous companies running AI workloads at scale. This session demonstrates the pitfalls, design choices and architectural challenges the teams have dealt with over the course of 3 years while evolving its serverless Kubernetes offering, including: -Secure Isolation Of Tenants On A Shared Infrastructure -Challenges in achieving 10 second autoscaling -On-Demand Cluster & Compute Provisioning For Tenants -Day 2 Operations & Managing A Fleet Of Clusters At Scale

Пікірлер

    Келесі