Distributed Checkpoint - Iris Zhang & Chien-Chin Huang, Meta

Ойын-сауық

Distributed Checkpoint - Iris Zhang & Chien-Chin Huang, Meta
This talk will present checkpoint features for distributed training. Distributed checkpoint support saving and loading from multiple ranks in parallel. It handles load-time resharding which enables saving in one cluster topolgy and loading to another. It also supports saving in one parallelism and loading into another. It is currently adopted by IBM, Mosaic, and XLA for FSDP checkpoint, and it is also being used for Shampoo OSS release checkpointing support. We will talk about distributed checkpoint support today and what is coming up next.

Пікірлер

    Келесі