Large Language Model inference with ONNX Runtime (Kunal Vaishnavi)

Ғылым және технология

Large Language Model inference with ONNX Runtime (Kunal Vaishnavi)
Learn how to combine the powerful capability of LLaMA-2, Mistral, Falcon and similar models with optimization and quantization improvements from ONNX Runtime (ORT). With the goal of making these models run efficiently and available for all devices, we have introduced several optimizations such as graph fusions and kernel improvements in ORT's inference capabilities. In this talk, we will go over the details of these optimizations and demonstrate the performance gains.
Links:
github.com/microsoft/onnxruntime
github.com/microsoft/onnxrunt...
onnxruntime.ai/blogs
onnxruntime.ai/docs/
Internal optimizations: github.com/microsoft/onnxrunt...
External optimizations: github.com/microsoft/onnxrunt...

Large Language Model inference with ONNX Runtime (Kunal Vaishnavi)

Ғылым және технология

Пікірлер

Келесі