Batch Systems in Production with Kueue: Multi-Tenancy and Fungibility- Yuki Iwai & Aldo Culquicondor

Ғылым және технология

Batch Systems in Production with Kueue: Multi-Tenancy and Fungibility - Yuki Iwai, CyberAgent, Inc. & Aldo Culquicondor, Google
Kueue is a could-native job scheduler with which you can build a multi-tenant batch system on a Kubernetes cluster. Kueue implements job queueing, deciding when jobs should wait and when they should start, based on quotas, priority and a hierarchy for sharing heterogeneous resources among teams. Kueue works on prem and in autoscaled environments in the cloud. In this talk, you will learn about Kueue’s architecture and extensibility to support a variety of workloads. You will also learn how Kueue is used in production in self-managed clusters, serving multiple machine-learning researchers, MLOps Engineers and data scientists. Kueue provides fair use while maximizing the utilization of accelerators and other resources, through its borrowing and preemption mechanisms. Kueue is used with frameworks like DeepSpeed, PyTorch, the kubernetes Job, RayJob, Jupyter, etc.

Пікірлер

    Келесі